layout: default title: Properties nav_order: 2 parent: Chars and Strings

Properties

Overview

Text processing requires that a program treat text appropriately. If text is exchanged between several systems, it is important for them to process the text consistently. This is done by assigning each character, or a range of characters, attributes or properties used for text processing, and by defining standard algorithms for at least the basic text operations.

Traditionally, such attributes and algorithms have not been well-defined for most character sets, and text processing had to rely on ad-hoc solutions. Over time, standards were created for querying properties of the system codepage. However, the set of these properties was limited. Their data was not coordinated among implementations, and standard algorithms were not available.

It is one of the strengths of Unicode that it not only defines a very large character set, but also assigns a comprehensive set of properties and usage notes to all characters. It defines standard algorithms for critical text processing, and the data is publicly provided and kept up-to-date. See https://www.unicode.org/ and https://www.unicode.org/main.html for more information.

Sample code is available in the ICU source code library at icu4c/source/samples/props/props.cpp. See also the source code for the Unicode browser demo application, which can be used online to browse Unicode characters with their properties.

Unicode Character Database properties in ICU APIs

The following table shows all Unicode Character Database properties (except for purely “extracted” ones and Unihan properties) and the corresponding ICU APIs. Most of the time, ICU4C provides functions in icu4c/source/common/unicode/uchar.h and ICU4J provides parallel functions in the com.ibm.icu.lang.UCharacter class. Properties of a single Unicode character are accessed by its 21-bit code point value (type: UChar32=int32_t in C/C++, int in Java).

Surrogate code points mostly have default property values, except for the General_Category (gc=Cs).

For integer values outside the Unicode code point range (negative or ≥ 0x110000), most API functions return null values (false, 0, etc.). API functions that map a code point to another (e.g., u_foldCase()/UCharacter.foldCase()) normally return out-of-range values (i.e., map them to themselves), just like for unassigned code points or generally code points that have no specific mappings. In particular, -1 (=U_SENTINEL in ICU4C) is mapped to -1.

Most properties are also available via UnicodeSet APIs and patterns. See the Lookup section below.

See UAX #44, Unicode Character Database itself for comparison. The UCD files PropertyAliases.txt and PropertyValueAliases.txt list all properties and their values by name and type.

UAX #44 also shows which UCD files have data for which properties, and many other useful details.

Most properties that use binary, integer, or enumerated values are available via functions u_hasBinaryProperty and u_getIntPropertyValue which take UProperty enum constants to select the property. (ICU4J UCharacter member functions do not have the “u_” prefix.) The constant names include the long property name according to PropertyAliases.txt, e.g., UCHAR_LINE_BREAK. Corresponding property value enum constant names often contain the short property name and the long value name, e.g., U_LB_LINE_FEED. For enumeration/integer type properties, the enumeration result type is also listed here.

Some UnicodeSet APIs use the same UProperty constants. Other UnicodeSet APIs and UnicodeSet and regular expression patterns use the long or short property aliases and property value aliases (see PropertyAliases.txt and PropertyValueAliases.txt).

There is one pseudo-property, UCHAR_GENERAL_CATEGORY_MASK for which the APIs do not use a single value but a bit-set (a mask) of zero or more values, with each bit corresponding to one UCHAR_GENERAL_CATEGORY value. This allows ICU to represent property value aliases for multiple general categories, like “Letters” (which stands for “Uppercase Letters”, “Lowercase Letters”, etc.). In other words, there are two ICU properties for the same Unicode property, one delivering single values (for per-code point lookup) and the other delivering sets of values (for use with value aliases and UnicodeSet).

UCD NameTypeICU4C uchar.h / ICU4J UCharacter
AgeUnicode version(U)C: u_charAge fills in UVersionInfo
Java: getAge returns a VersionInfo reference
Alphabeticbinary(U)u_isUAlphabetic, UCHAR_ALPHABETIC
ASCII_Hex_Digitbinary(U)UCHAR_ASCII_HEX_DIGIT
Basic_Emoji*binary(U)UCHAR_BASIC_EMOJI
Bidi_Classenum(U)u_charDirection, UCHAR_BIDI_CLASS
returns enum UCharDirection
Bidi_Controlbinary(U)UCHAR_BIDI_CONTROL
Bidi_Mirroredbinary(U)u_isMirrored, UCHAR_BIDI_MIRRORED
Bidi_Mirroring_Glyphcode pointu_charMirror
Bidi_Paired_Bracket_Typeenum(U)UCHAR_BIDI_PAIRED_BRACKET_TYPE
returns enum UBidiPairedBracketType
Blockenum(U)ublock_getCode, UCHAR_BLOCK
returns enum UBlockCode
Canonical_Combining_Class0..255(U)u_getCombiningClass, UCHAR_CANONICAL_COMBINING_CLASS
Case_FoldingUnicode stringu_strFoldCase (ustring.h)
Case_Ignorablebinary(U)UCHAR_CASE_IGNORABLE
Casedbinary(U)UCHAR_CASED
Changes_When_Casefoldedbinary(U)UCHAR_CHANGES_WHEN_CASEFOLDED
Changes_When_Casemappedbinary(U)UCHAR_CHANGES_WHEN_CASEMAPPED
Changes_When_NFKC_Casefoldedbinary(U)UCHAR_CHANGES_WHEN_NFKC_CASEFOLDED
Changes_When_Lowercasedbinary(U)UCHAR_CHANGES_WHEN_LOWERCASED
Changes_When_Titlecasedbinary(U)UCHAR_CHANGES_WHEN_TITLECASED
Changes_When_Uppercasedbinary(U)UCHAR_CHANGES_WHEN_UPPERCASED
Composition_Exclusionbinary(c)contributes to Full_Composition_Exclusion
Dashbinary(U)UCHAR_DASH
Decomposition_MappingUnicode stringNFKC Normalizer2::getRawDecomposition()
Decomposition_Typeenum(U)UCHAR_DECOMPOSITION_TYPE
returns enum UDecompositionType
Default_Ignorable_Code_Pointbinary(U)UCHAR_DEFAULT​_IGNORABLE_CODE_POINT
Deprecatedbinary(U)UCHAR_DEPRECATED
Diacriticbinary(U)UCHAR_DIACRITIC
East_Asian_Widthenum(U)UCHAR_EAST_ASIAN_WIDTH
returns enum UEastAsianWidth
Emojibinary(U)UCHAR_EMOJI
Emoji_Componentbinary(U)UCHAR_EMOJI_COMPONENT
Emoji_Keycap_Sequence*binary(U)UCHAR_EMOJI_KEYCAP_SEQUENCE
Emoji_Modifierbinary(U)UCHAR_EMOJI_MODIFIER
Emoji_Modifier_Basebinary(U)UCHAR_EMOJI_MODIFIER_BASE
Emoji_Presentationbinary(U)UCHAR_EMOJI_PRESENTATION
Expands_On_NF*binaryavailable via normalization API (normalizer2.h)
Extended_Pictographicbinary(U)UCHAR_EXTENDED_PICTOGRAPHIC
Extenderbinary(U)UCHAR_EXTENDER
FC_NFKC_ClosureUnicode stringu_getFC_NFKC_Closure
Full_Composition_Exclusionbinary(U)UCHAR_FULL​_COMPOSITION_EXCLUSION
General_Categoryenum(U)u_charType, UCHAR_GENERAL_CATEGORY, UCHAR_GENERAL_CATEGORY_MASK
returns enum UCharCategory
Grapheme_Basebinary(U)UCHAR_GRAPHEME_BASE
Grapheme_Cluster_Breakenum(U)UCHAR_GRAPHEME_CLUSTER_BREAK
returns enum UGraphemeClusterBreak
Grapheme_Extendbinary(U)UCHAR_GRAPHEME_EXTEND
Grapheme_Linkbinary(U)UCHAR_GRAPHEME_LINK
Hangul_Syllable_Typeenum(U)UCHAR_HANGUL_SYLLABLE_TYPE
returns enum UHangulSyllableType
Hex_Digitbinary(U)UCHAR_HEX_DIGIT
Hyphenbinary(U)UCHAR_HYPHEN
ID_Continuebinary(U)UCHAR_ID_CONTINUE
ID_Startbinary(U)UCHAR_ID_START
Ideographicbinary(U)UCHAR_IDEOGRAPHIC
IDS_Binary_Operatorbinary(U)UCHAR_IDS_BINARY_OPERATOR
IDS_Triary_Operatorbinary(U)UCHAR_IDS_TRINARY_OPERATOR
Indic_Positional_Categoryenum(U)UCHAR_INDIC_POSITIONAL_CATEGORY
returns enum UIndicPositionalCategory
Indic_Syllabic_Categoryenum(U)UCHAR_INDIC_SYLLABIC_CATEGORY
returns enum UIndicSyllabicCategory
ISO_CommentASCII stringu_getISOComment
Jamo_Short_NameASCII string(c)contributes to Name
Join_Controlbinary(U)UCHAR_JOIN_CONTROL
Joining_Groupenum(U)UCHAR_JOINING_GROUP
returns enum UJoiningGroup
Joining_Typeenum(U)UCHAR_JOINING_TYPE
returns enum UJoiningType
Line_Breakenum(U)UCHAR_LINE_BREAK
returns enum ULineBreak
Logical_Order_Exceptionbinary(U)UCHAR_LOGICAL_ORDER_EXCEPTION
Lowercasebinary(U)u_isULowercase, UCHAR_LOWERCASE
Lowercase_MappingUnicode stringavailable via u_strToLower (ustring.h)
Mathbinary(U)UCHAR_MATH
NameASCII string(U)u_charName(U_UNICODE_CHAR_NAME or U_EXTENDED_CHAR_NAME)
Name_AliasASCII stringu_charName(U_CHAR_NAME_ALIAS)
NF*_QuickCheckenum(U)UCHAR_NF*_QUICK_CHECK and available via quickCheck (normalizer2.h)
returns UNormalizationCheckResult (no/maybe/yes)
NFKC_CasefoldUnicode stringavailable via normalization API (normalizer2.h “nfkc_cf”)
Noncharacter_Code_Pointbinary(U)UCHAR_NONCHARACTER​_CODE_POINT,
U_IS_UNICODE_NONCHAR (utf.h)
Numeric_Typeenum(U)UCHAR_NUMERIC_TYPE
returns enum UNumericType
Numeric_Valuedouble(U)u_getNumericValueJava/UnicodeSet: only non-negative integers, no fractions
Other_Alphabeticbinary(c)contributes to Alphabetic
Other_Default_Ignorable​_Code_Pointbinary(c)contributes to Default_Ignorable​_Code_Point
Other_Grapheme_Extendbinary(c)contributes to Grapheme_Extend
Other_Lowercasebinary(c)contributes to Lowercase
Other_Mathbinary(c)contributes to Math
Other_Uppercasebinary(c)contributes to Uppercase
Pattern_Syntaxbinary(U)UCHAR_PATTERN_SYNTAX
Pattern_White_Spacebinary(U)UCHAR_PATTERN_WHITE_SPACE
Prepended_Concatenation_Markbinary(U)UCHAR_PREPENDED_CONCATENATION_MARK
Quotation_Markbinary(U)UCHAR_QUOTATION_MARK
Radicalbinary(U)UCHAR_RADICAL
Regional_Indicatorbinary(U)UCHAR_REGIONAL_INDICATOR
RGI_Emoji*binary(U)UCHAR_RGI_EMOJI
RGI_Emoji_Flag_Sequence*binary(U)UCHAR_RGI_EMOJI_FLAG_SEQUENCE
RGI_Emoji_Modifier_Sequence*binary(U)UCHAR_RGI_EMOJI_MODIFIER_SEQUENCE
RGI_Emoji_Tag_Sequence*binary(U)UCHAR_RGI_EMOJI_TAG_SEQUENCE
RGI_Emoji_ZWJ_Sequence*binary(U)UCHAR_RGI_EMOJI_ZWJ_SEQUENCE
Scriptenum(U)uscript_getCode (uscript.h), UCHAR_SCRIPT
returns enum UScriptCode
Script_Extensionslist(U)uscript_getScriptExtensions & uscript_hasScript (uscript.h), UCHAR_SCRIPT_EXTENSIONS
returns a list of enum UScriptCode values
Sentence_Breakenum(U)UCHAR_SENTENCE_BREAK
returns enum USentenceBreak
Simple_Case_Foldingcode pointu_foldCase
Simple_Lowercase_ Mappingcode pointu_tolower
Simple_Titlecase_ Mappingcode pointu_totitle
Simple_Uppercase_ Mappingcode pointu_toupper
Soft_Dottedbinary(U)UCHAR_SOFT_DOTTED
STermbinary(U)UCHAR_S_TERM
Terminal_Punctuationbinary(U)UCHAR_TERMINAL_PUNCTUATION
Titlecase_MappingUnicode stringu_strToTitle (ustring.h)
Unicode_1_NameASCII string(U)u_charName(U_UNICODE_10_CHAR_NAME or U_EXTENDED_CHAR_NAME)
Unified_Ideographbinary(U)UCHAR_UNIFIED_IDEOGRAPH
Uppercasebinary(U)u_isUUppercase, UCHAR_UPPERCASE
Uppercase_MappingUnicode stringu_strToUpper (ustring.h)
Vertical_Orientationenum(U)UCHAR_VERTICAL_ORIENTATION
returns enum UVerticalOrientation
White_Spacebinary(U)u_isUWhiteSpace, UCHAR_WHITE_SPACE
Word_Breakenum(U)UCHAR_WORD_BREAK
returns enum UWordBreakValues
XID_Continuebinary(U)UCHAR_XID_CONTINUE
XID_Startbinary(U)UCHAR_XID_START

Notes:

  1. (c) - This property only contributes to “real” properties (mostly “Other_...” properties), so there is no direct support for this property in ICU.

  2. (U) - This property is available via the UnicodeSet APIs and patterns. Any property available in UnicodeSet is also available in regular expressions. Properties which are not available in UnicodeSet are generally those that are not available through a UProperty selector.

  3. When a property name is followed by a star (*), it is a property of strings; for example, Basic_Emoji and RGI_Emoji. See https://www.unicode.org/reports/tr51/#Emoji_Sets Properties of strings are not yet supported in ICU regular expressions.

  4. UnicodeSet [:scx=Arab:] is a superset of [:sc=Arab:]; see https://www.unicode.org/reports/tr18/#Script_Property

  5. Full case mapping properties (e.g., Lowercase_Mapping) are complex. The string case mapping functions that implement them handle language-specific and/or context-sensitive mappings. The output may have more code points or fewer code points than the input.

Customization

ICU does not provide the means to modify properties at runtime. The properties are provided exactly as specified by a recent version of the Unicode Standard (as published in the Character Database).

For custom sets and maps, it is easiest to make UnicodeSet or UCPTrie/CodePointTrie objects with the desired values.

However, if an application requires custom properties (for example, for Private Use characters), then it is possible to change or add them at build-time. This is doable but not easy.

It is done by modifying the Character Database files copied into the ICU source tree at icu4c/source/data/unidata. Since ICU 49, most of the properties have been combined into one file, unidata/ppucd.txt (see the Preparsed UCD design doc). Some of the remaining UCD files are still inputs, others are only used for unit tests.

To add a character to such a file, a line must be inserted into the file with the format used in that file (see the online documentation on the Unicode site for more information). After modifying one or more of these files, the ICU data needs to be rebuilt, and the resulting files need to be checked into the ICU source tree. The files are processed by special ICU tools outside of the normal ICU build. The unidata/changes.txt file documents the process that has been used for the last several Unicode version updates; skip the file preparation and API update steps.

Any available Unicode code point (0 to 10FFFF16) can be used. Code point values should be written with either 4, 5, or 6 hex digits. The minimum number of digits possible should be used (but no fewer than 4). Note that the Unicode Standard specifies that the 32 code points U+FDD0..U+FDEF and the 34 code points U+...xFFFE and U+...xFFFF (where x=0, 1, 2, ..., F, 10) are not characters, therefore they should not be added to any of the character database files.

Lookup

For lookup by code point, iterate through the string, fetch code points, and either call the unicode/uchar.h / UCharacter or similar functions, or use dedicated sets and maps. For binary properties, and sets in general, there are also more efficient methods for iterating over substrings.

Binary property from code point

Call one of the binary-property functions. Alternatively, make a UnicodeSet for the property (remember to freeze() it) or for a custom set of characters, and call contains().

Binary property over string

It is often useful to partition a string into substrings where every character has the property, and substrings where every character does not have the property. For example, to split the string at separator characters, remove certain types of characters, trim white space, etc. Use a UnicodeSet with its span() and spanBack() methods (available in C++ in UTF-8 versions). In Java, you can also use a UnicodeSetSpanner.

Enumerated property from code point

Call one of the int-property functions. Alternatively, build a UCPTrie / CodePointTrie (new in ICU 63) via its mutable version and build method, then use that to get the int value for each code point.

Enumerated property over string

Easiest is to iterate over code points of the string and call per-code point lookup methods (or use a code point trie).

The UCPTrie / CodePointTrie (new in ICU 63) also offers C macros and a Java String iterator class where the iteration and data lookup are integrated to avoid redundancies in validation and range checks.

The UTF-16 code point macros and the Java String iterator also provide the code point as output, because it has to be fetched or assembled anyway.

The UTF-8 macros do not assemble the code point because that would be some amount of extra work, but often only the lookup value is used and the code point is not needed. When it is needed after all, it is possible to take advantage of the macros having validated the byte sequence: If the sequence was ill-formed, then the trie's error value is set. Therefore, if a value other than the trie error value was returned, then the sequence was well-formed, and the code point can be fetched without revalidating the sequence (e.g., via U8_NEXT_UNSAFE()). Since the length of the sequence (1..4 bytes) is also known from the iteration (string index before/after next() call), an even simpler piece of code can be used. (See for example the ICU-internal function codePointFromValidUTF8() in normalizer2impl.cpp.)

Code point trie most-optimized UTF-16 access

UTF-16 text processing can be further optimized by detecting surrogate pairs and assembling supplementary code points only when there is non-trivial data available.

At build time, iterate over all supplementary code points (umutablecptrie_getRange() / MutableCodePointTrie.getRange() starting from U+10000) to see if there is non-trivial data for any of the supplementary code points associated with a lead surrogate. If so, then set a special (application-specific) value for the lead surrogate.

At runtime, use UCPTRIE_FAST_BMP_GET() per code unit. If there is non-trivial data and the code unit is a lead surrogate, then check if a trail surrogate follows. If so, assemble the supplementary code point with U16_GET_SUPPLEMENTARY() and look up its value with UCPTRIE_FAST_SUPP_GET(); otherwise deal with the unpaired surrogate in some way. (Java CodePointTrie.Fast and java.lang.Character have equivalent methods.)

If there is only trivial data for lead and trail surrogates, then processing can often skip them. (In this case, there will be two data lookups, one for the lead surrogate and one for the trail surrogate, but they are fast, and this optimization speeds up the more common BMP characters by not checking for surrogates each time.)

For example, in normalization or case mapping all characters that do not have any mappings are simply copied as is.

Properties in ICU Rule Syntax

ICU rule syntaxes should use the Unicode Pattern_White_Space set as syntactic “spaces” to allow for the usage of white space characters outside of the normal ASCII range while still maintaining backward compatibility. See https://www.unicode.org/reports/tr31/#Pattern_Syntax for more information.