| # UnicodeSet |
| |
| ## Overview |
| |
| A UnicodeSet is an object that represents a set of Unicode characters or |
| character strings. The contents of that object can be specified either by |
| patterns or by building them programmatically. |
| |
| Here are a few examples of sets: |
| |
| | Pattern | Description | |
| |--------------|-------------------------------------------------------------| |
| | [a-z] | The lower case letters a through z | |
| | [abc123] | The six characters a,b,c,1,2 and 3 | |
| | [\p{Letter}] | All characters with the Unicode General Category of Letter. | |
| |
| String Values In addition to being a set of characters (of Unicode code points), |
| a UnicodeSet may also contain string values. Conceptually, the UnicodeSet is |
| always a set of strings, not a set of characters, although in many common use |
| cases the strings are all of length one, which reduces to being a set of |
| characters. |
| |
| This concept can be confusing when first encountered, probably because similar |
| set constructs from other environments (regular expressions) can only contain |
| characters. |
| |
| ## UnicodeSet Patterns |
| |
| Patterns are a series of characters bounded by square brackets that contain |
| lists of characters and Unicode property sets. Lists are a sequence of |
| characters that may have ranges indicated by a '-' between two characters, as in |
| "a-z". The sequence specifies the range of all characters from the left to the |
| right, in Unicode order. For example, \[a c d-f m\] is equivalent to \[a c d e f |
| m\]. Whitespace can be freely used for clarity as \[a c d-f m\] means the same |
| as \[acd-fm\]. |
| |
| Unicode property sets are specified by a Unicode property, such as \[:Letter:\]. |
| For a list of supported properties, see the [Properties](properties.md) chapter. |
| For details on the use of short vs. long property and property value names, see |
| the end of this section. The syntax for specifying the property names is an |
| extension of either POSIX or Perl syntax with the addition of "=value". For |
| example, you can match letters by using the POSIX syntax \[:Letter:\], or by |
| using the Perl-style syntax \\p{Letter}. The type can be omitted for the |
| Category and Script properties, but is required for other properties. |
| |
| The table below shows the two kinds of syntax: POSIX and Perl style. Also, the |
| table shows the "Negative", which is a property that excludes all characters of |
| a given kind. For example, \[:^Letter:\] matches all characters that are not |
| \[:Letter:\]. |
| |
| | | Positive | Negative | |
| |--------------------|----------------|-----------------| |
| | POSIX-style Syntax | [:type=value:] | [:^type=value:] | |
| | Perl-style Syntax | \p{type=value} | \P{type=value} | |
| |
| These following low-level lists or properties then can be freely combined with |
| the normal set operations (union, inverse, difference, and intersection): |
| |
| | | Example | Corresponding Method | Meaning | |
| |-------|-------------------------|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| | A B | [[:letter:] [:number:]] | A.addAll(B) | To union two sets A and B, simply concatenate them | |
| | A & B | [[:letter:] & [a-z]] | A.retainAll(B) | To intersect two sets A and B, use the '&' operator. | |
| | A - B | [[:letter:] - [a-z]] | A. removeAll (B) | To take the set-difference of two sets A and B, use the '-' operator. | |
| | [^A] | [^a-z] | A. complement (B) | To invert a set A, place a '^' immediately after the opening '['. Note that the complement only affects code points, not string values. In any other location, the '^' does not have a special meaning. | |
| |
| ### Precedence |
| |
| The binary operators of union, intersection, and set-difference have equal |
| precedence and bind left-to-right. Thus the following are equivalent: |
| |
| * [[:letter:] - [a-z] [:number:] & [\u0100-\u01FF]] |
| * [[[[[:letter:] - [a-z]] [:number:]] & [\u0100-\u01FF]]. |
| |
| Another example is that the set \[\[ace\]\[bdf\] - \[abc\]\[def\]\] is **not** |
| the empty set, but instead the set \[def\]. That is because the syntax |
| corresponds to the following UnicodeSet operations: |
| |
| 1. start with \[ace\] |
| 2. addAll \[bdf\] *-- we now have \[abcdef\]* |
| 3. removeAll \[abc\] *-- we now have \[def\]* |
| 4. addAll \[def\] *-- no effect, we still have \[def\]* |
| |
| This only really matters where there are the difference and intersection |
| operations, as the union operation is commutative. To make sure that the - is |
| the main operator, add brackets to group the operations as desired, such as |
| \[\[ace\]\[bdf\] - \[\[abc\]\[def\]\]\]. |
| |
| Another caveat with the '&' and '-' operators is that they operate between |
| **sets**. That is, they must be immediately preceded and immediately followed by |
| a set. For example, the pattern \[\[:Lu:\]-A\] is illegal, since it is |
| interpreted as the set \[:Lu:\] followed by the incomplete range -A. To specify |
| the set of uppercase letters except for 'A', enclose the 'A' in a set: |
| \[\[:Lu:\]-\[A\]\]. |
| |
| ### Examples |
| |
| | [a] | The set containing 'a' | |
| |------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| | [a-z] | The set containing 'a' through 'z' and all letters in between, in Unicode order | |
| | [^a-z] | The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+FFFF | |
| | [[pat1][pat2]] | The union of sets specified by pat1 and pat2 | |
| | [[pat1]& [pat2]] | The intersection of sets specified by pat1 and pat2 | |
| | [[pat1]- [pat2]] | The asymmetric difference of sets specified by pat1 and pat2 | |
| | [:Lu:] | The set of characters belonging to the given Unicode category, as defined by Character.getType(); in this case, Unicode uppercase letters. The long form for this is [:UppercaseLetter:]. | |
| | [:L:] | The set of characters belonging to all Unicode categories starting with 'L', that is, [[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]. The long form for this is [:Letter:]. | |
| |
| ### String Values in Sets |
| |
| String values are enclosed in {curly brackets}. |
| |
| | Set expression | Description | |
| |------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| | [abc{def}] | A set containing four members, the single characters a, b and c, and the string “def” | |
| | [{abc}{def}] | A set containing two members, the string “abc” and the string “def”. | |
| | [{a}{b}{c}][abc] | These two sets are equivalent. Each contains three items, the three individual characters a, b and c. A {string} containing a single character is equivalent to that same character specified in any other way. | |
| |
| ### Character Quoting and Escaping in Unicode Set Patterns |
| |
| SINGLE QUOTE |
| |
| Two single quotes represents a single quote, either inside or outside single |
| quotes. |
| |
| Text within single quotes is not interpreted in any way (except for two adjacent |
| single quotes). It is taken as literal text (special characters become |
| non-special). |
| |
| These quoting conventions for ICU UnicodeSets differ from those of regular |
| expression character set expressions. In regular expressions, single quotes have |
| no special meaning and are treated like any other literal character. |
| |
| BACKSLASH ESCAPES |
| |
| Outside of single quotes, certain backslashed characters have special meaning: |
| |
| | \uhhhh | Exactly 4 hex digits; h in [0-9A-Fa-f] | |
| |------------|----------------------------------------| |
| | \Uhhhhhhhh | Exactly 8 hex digits | |
| | \xhh | 1-2 hex digits | |
| | \ooo | 1-3 octal digits; o in [0-7] | |
| | \a | U+0007 (BELL) | |
| | \b | U+0008 (BACKSPACE) | |
| | \t | U+0009 (HORIZONTAL TAB) | |
| | \n | U+000A (LINE FEED) | |
| | \v | U+000B (VERTICAL TAB) | |
| | \f | U+000C (FORM FEED) | |
| | \r | U+000D (CARRIAGE RETURN) | |
| | \\ | U+005C (BACKSLASH) | |
| |
| Anything else following a backslash is mapped to itself, except in an |
| environment where it is defined to have some special meaning. For example, |
| \\p{Lu} is the set of uppercase letters in UnicodeSet. |
| |
| Any character formed as the result of a backslash escape loses any special |
| meaning and is treated as a literal. In particular, note that \\u and \\U |
| escapes create literal characters. (In contrast, the Java compiler treats |
| Unicode escapes as just a way to represent arbitrary characters in an ASCII |
| source file, and any resulting characters are **not** tagged as literals.) |
| |
| WHITESPACE |
| |
| Whitespace (as defined by our API) is ignored unless it is quoted or |
| backslashed. |
| |
| *The rules for quoting and white space handling are common to most ICU APIs that |
| process rule or expression strings, including UnicodeSet, Transliteration and |
| Break Iterators.* |
| *ICU Regular Expression set expressions have a different (but similar) syntax, |
| and a different set of recognized backslash escapes. \[Sets\] in ICU Regular |
| Expressions follow the conventions from Perl and Java regular expressions rather |
| than the pattern syntax from ICU UnicodeSet. * |
| |
| ## Using a UnicodeSet |
| |
| For best performance, once the set contents is complete, freeze() the set to |
| make it immutable and to speed up contains() and span() operations (for which it |
| builds a small additional data structure). |
| |
| The most basic operation is contains(code point) or, if relevant, |
| contains(string). |
| |
| For splitting and partitioning strings, it is simpler and faster to use span() |
| and spanBack() rather than iterate over code points and calling contains(). In |
| Java, there is also a class UnicodeSetSpanner for somewhat higher-level |
| operations. See also the “Lookup” section of the [Properties](properties.md) |
| chapter. |
| |
| ## Programmatically Building UnicodeSets |
| |
| ICU users can programmatically build a UnicodeSet by adding or removing ranges |
| of characters or by using the retain (intersection), remove (difference), and |
| add (union) operations. |
| |
| ## Property Values |
| |
| The following property value variants are recognized: |
| |
| | Format | Description | Example | |
| |--------|-----------------------------------------------------------------------------------------------------|-----------------------------------| |
| | short | omits the type (used to prevent ambiguity and only allowed with the Category and Script properties) | Lu | |
| | medium | uses an abbreviated type and value | gc=Lu | |
| | long | uses a full type and value | General_Category=Uppercase_Letter | |
| |
| If the type or value is omitted, then the equals sign is also omitted. The short |
| style is only |
| used for Category and Script properties because these properties are very common |
| and their omission is unambiguous. |
| |
| In actual practice, you can mix type names and values that are omitted, |
| abbreviated, or full. For example, if Category=Unassigned you could use what is |
| in the table explicitly, \\p{gc=Unassigned}, \\p{Category=Cn}, or |
| \\p{Unassigned}. |
| |
| When these are processed, case and whitespace are ignored so you may use them |
| for clarity, if desired. For example, \\p{Category = Uppercase Letter} or |
| \\p{Category = uppercase letter}. |
| |
| For a list of supported properties, see the [Properties](properties.md) chapter. |
| |
| ## Getting UnicodeSet from Script |
| |
| ICU provides the functionality of getting UnicodeSet from the script. Here is an |
| example of generating a pattern from all the scripts that are associated to a |
| Locale and then getting the UnicodeSet based on the generated pattern. |
| |
| **In C:** |
| |
| UErrorCode err = U_ZERO_ERROR; |
| const int32_t capacity = 10; |
| const char * shortname = NULL; |
| int32_t num, j; |
| int32_t strLength =4; |
| UChar32 c = 0x00003096 ; |
| UScriptCode script[10] = {USCRIPT_INVALID_CODE}; |
| UScriptCode scriptcode = USCRIPT_INVALID_CODE; |
| num = uscript_getCode("ja",script,capacity, &err); |
| printf("%s %d \n" ,"Number of script code associated are :", num); |
| UnicodeString temp = UnicodeString("[", 1, US_INV); |
| UnicodeString pattern; |
| for(j=0;j<num;j++){ |
| shortname = uscript_getShortName(script[j]); |
| UnicodeString str(shortname,strLength,US_INV); |
| temp.append("[:"); |
| temp.append(str); |
| temp.append(":]+"); |
| } |
| pattern = temp.remove(temp.length()-1,1); |
| pattern.append("]"); |
| UnicodeSet cnvSet(pattern, err); |
| printf("%d\n", cnvSet.size()); |
| printf("%d\n", cnvSet.contains(c)); |
| |
| **In Java:** |
| |
| ULocale ul = new ULocale("ja"); |
| int script[] = UScript.getCode(ul); |
| String str ="["; |
| for(int i=0;i<script.length;i++){ |
| str = str + "[:"+UScript.getShortName(script[i])+":]+"; |
| } |
| String pattern =str.substring(0, (str.length()-1)); |
| pattern = pattern + "]"; |
| System.out.println(pattern); |
| UnicodeSet ucs = new UnicodeSet(pattern); |
| System.out.println(ucs.size()); |
| System.out.println(ucs.contains(0x00003096)); |