| --- |
| layout: default |
| title: Architecture |
| nav_order: 2 |
| parent: Collation |
| --- |
| <!-- |
| © 2020 and later: Unicode, Inc. and others. |
| License & terms of use: http://www.unicode.org/copyright.html |
| --> |
| |
| # Collation Service Architecture |
| {: .no_toc } |
| |
| ## Contents |
| {: .no_toc .text-delta } |
| |
| 1. TOC |
| {:toc} |
| |
| --- |
| |
| ## Overview |
| |
| This section describes the design principles, architecture and coding |
| conventions of the ICU Collation Service. |
| |
| ## Collator |
| |
| To use the Collation Service, a Collator must first be instantiated. An |
| Collator is a data structure or object that maintains all of the property |
| and state information necessary to define and support the specific collation |
| behavior provided. Examples of properties described in the Collator are the |
| locale, whether normalization is to be performed, and how many levels of |
| collation are to be evaluated. Examples of the state information described in |
| the Collator include the direction of a Collation Element Iterator (forward |
| or backward) and the status of the last API executed. |
| |
| The Collator is instantiated either by referencing a locale or by defining a |
| custom set of rules (a tailoring). |
| |
| The Collation Service uses the paradigm: |
| |
| 1. Open a Collator, |
| |
| 2. Use while necessary, |
| |
| 3. Close the Collator. |
| |
| Collator instances cannot be shared among threads. You should open them |
| instead, and use a different collator for each separate thread. The safe clone |
| function is supported for cloning collators in a thread-safe fashion. |
| |
| The Collation Service follows the ICU conventions for locale designation |
| when opening collators: |
| |
| 1. NULL means the default locale. |
| |
| 2. The empty locale name ("") means the root locale. |
| The Collation Service adheres to the ICU conventions described in the |
| "[ICU Architectural Design](../design.md) " section of the users guide. |
| In particular: |
| |
| 3. The standard error code convention is usually followed. (Functions that do |
| not take an error code parameter do so for backward compatibility.) |
| |
| 4. The string length convention is followed: when passing a `UChar *`, the |
| length is required in a separate argument. If -1 is passed for the length, |
| it is assumed that the string is zero terminated. |
| |
| ### Collation locale and keyword handling |
| |
| When a collator is created from a locale, the collation service (like all ICU |
| services) must map the requested locale to the localized collation data |
| available to ICU at the time. It does so using the standard ICU locale fallback |
| mechanism. See the fallback section of the [locale |
| chapter](../locale/index.md) for more details. |
| |
| If you pass a regular locale in, like "en_US", the collation service first |
| searches with fallback for "collations/default" key. The first such key it finds |
| will have an associated string value; this is the keyword name for the collation |
| that is default for this locale. If the search falls all the way back to the |
| root locale, the collation service will us the "collations/default" key there, |
| which has the value "standard". |
| |
| If there is a locale with a keyword, like "de-u-co-phonebk" or "de@collation=phonebook", the |
| collation service searches with fallback for "collations/phonebook". If the |
| search is successful, the collation service uses the string value it finds to |
| instantiate a Collator. If the search fails because no such key is present in |
| any of ICU's locale data (e.g., "de@collation=funky"), the service returns a |
| collator implementing the default tailoring of the locale. |
| If the fallback is all the way to the root locale, then |
| the return `UErrorCode` is `U_USING_DEFAULT_WARNING`. |
| |
| ## Input values for collation |
| |
| Collation deals with processing strings. ICU generally requires that all the |
| strings should be in UTF-16 format, and that all the required conversion should |
| done before ICU functions are used. In the case of collation, there are APIs |
| that can also take instances of character iterators (`UCharIterator`) |
| or UTF-8 directly. |
| |
| Theoretically, character iterators can iterate strings |
| in any encoding. ICU currently provides character iterator implementations for |
| UTF-8 and UTF-16BE (useful when processing data from a big endian platform on an |
| little endian machine). It should be noted, however, that using iterators for |
| collation APIs has a performance impact. It should be used in situations when it |
| is not desirable to convert whole strings before the operation - such as when |
| using a string compare function. |
| |
| ## Collation Elements |
| |
| As discussed in the introduction, there are many possible orderings for sorted |
| text, depending on language and other factors. Ideally, there is a way to |
| describe each ordering as a set of rules for calculating numeric values for each |
| string of text. The collation process then becomes one of simply comparing these |
| numeric values. |
| |
| This essentially describes the way the Collation Service works. To implement |
| a particular sort ordering, first the relationship between each character or |
| character sequence is derived. For example, a Spanish ordering defines the |
| letter sequence "CH" to be between the letters "C" and "D". As also discussed in |
| the introduction, to order strings properly requires that comparison of base |
| letters must be considered separately from comparison of accents. Letter case |
| must also be considered separately from either base letters or accents. Any |
| ordering specification language must provide a way to define the relationships |
| between characters or character sequences on multiple levels. ICU supports this |
| by using "<" to describe a relationship at the primary level, using "<<" to |
| describe a relationship at the secondary level, and using "<<<" to describe a |
| relationship at the tertiary level. Here are some example usages: |
| |
| Symbol | Example | Description |
| ------ | -------- | ----------- |
| `<` | `c < ch` | Make a primary (base letter) difference between "c" and the character sequence "ch" |
| `<<` | `a << ä` | Make a secondary (accent) difference between "a" and "ä" |
| `<<<` | `a<<<A` | Make a tertiary difference between "a" and "A" |
| |
| A more complete description of the ordering specification symbols and their |
| meanings is provided in the section on Collation Tailoring. |
| |
| Once a sort ordering is defined by specifying the desired relationships between |
| characters and character sequences, ICU can convert these relationships to a |
| series of numerical values (one for each level) that satisfy these same |
| relationships. |
| |
| This series of numeric values, representing the relative weighting of a |
| character or character sequence, is called a Collation Element (CE). |
| One possible encoding of a Collation Element is a 32-bit value consisting of |
| a 16-bit primary weight, a 8-bit secondary weight, |
| 2 case bits, and a 6-bit tertiary weight. |
| |
| The sort weight of a string is represented by the collation elements of its |
| component characters and character sequences. For example, the sort weight of |
| the string "apple" would consist of its component Collation Elements, as shown |
| here: |
| |
| "Apple" | "Apple" Collation Elements |
| ------- | -------------------------- |
| a | `[1900.05.05]` |
| p | `[3700.05.05]` |
| p | `[3700.05.05]` |
| l | `[2F00.05.05]` |
| e | `[2100.05.05]` |
| |
| In this example, the letter "a" has a 16-bit primary weight of 1900 (hex), an |
| 8-bit secondary weight of 05 (hex), and a combined 8-bit case-tertiary weight of |
| 05 (hex). |
| |
| String comparison is performed by comparing the collation elements of each |
| string. Each of the primary weights are compared. If a difference is found, that |
| difference determines the relationship between the two strings. If no |
| differences are found, the secondary weights are compared and so forth. |
| |
| With ICU it is possible to specify how many levels should be compared. For some |
| applications, it can be desirable to compare only primary levels or to compare |
| only primary and secondary levels. |
| |
| ## Sort Keys |
| |
| If a string is to be compared thousands or millions of times, |
| it can be more efficient to use sort keys. |
| Sort keys are useful in situations where a large amount of data is indexed |
| and frequently searched. The sort key is generated once and used in subsequent |
| comparisons, rather than repeatedly generating the string's Collation Elements. |
| The comparison of sort keys is a very efficient and simple binary compare of strings of |
| unsigned bytes. |
| |
| An important property of ICU sort keys is that you can obtain the same results |
| by comparing 2 strings as you do by comparing the sort keys of the 2 strings |
| (provided that the same ordering and related collation attributes are used). |
| |
| An ICU sort key is a pre-processed sequence of bytes generated from a Unicode |
| string. The weights for each comparison level are concatenated, separated by a |
| "0x01" byte between levels. |
| The entire sequence is terminated with a 0x00 byte for convenience in C APIs. |
| (This 0x00 terminator is counted in the sort key length — |
| unlike regular strings where the NUL terminator is excluded from the string length.) |
| |
| ICU actually compresses the sort keys so that they take the |
| minimum storage in memory and in databases. |
| |
| <!-- TODO: (diagram was missing in Google Sites already) |
| The diagram below represents an uncompressed sort key in ICU for ease of understanding. --> |
| |
| ### Sort key size |
| |
| One of the more important issues when considering using sort keys is the sort |
| key size. Unfortunately, it is very hard to give a fast exact answer to the |
| following question: "What is the maximum size for sort keys generated for |
| strings of size X". This problem is twofold: |
| |
| 1. The maximum size of the sort key depends on the size of the collation |
| elements that are used to build it. Size of collation elements vary greatly |
| and depends both on the alphabet in question and on the locale used. |
| |
| 2. Compression is used in building sort keys. Most 'regular' sequences of |
| characters produce very compact sort keys. |
| |
| If one is to assume the worst case and use too-big buffers, a lot of space will |
| be wasted. However, if you use too-small buffers, you will lose performance if |
| generated sort keys are longer than supplied buffers too often |
| (and you have to reallocate for each of those). |
| A good strategy |
| for this problem would be to manually manage a large buffer for storing sortkeys |
| and keep a list of indices to sort keys in this buffer (see the "large buffers" |
| [Collation Example](examples#using-large-buffers-to-manage-sort-keys) |
| for more details). |
| |
| Here are some rules of a thumb, please do not rely on them. If you are looking |
| at the East Asian locales, you probably want to go with 5 bytes per code point. |
| For Thai, 3 bytes per code point should be sufficient. For all the other locales |
| (mostly Latin and Cyrillic), you should be fine with 2 bytes per code point. |
| These values are based on average lengths of sort keys generated with tertiary |
| strength. If you need quaternary and identical strength (you should not), add 3 |
| bytes per code point to each of these. |
| |
| ### Partial sort keys |
| |
| In some cases, most notably when implementing [radix |
| sorting](http://en.wikipedia.org/wiki/Radix_sort), it is useful to produce only |
| parts of sort keys at a time. ICU4C 2.6+ provides an API that allows producing |
| parts of sort keys (`ucol_nextSortKeyPart` API). These sort keys may or may not be |
| compressed; that is, they may or may not be compatible with regular sort keys. |
| |
| ### Merging sort keys |
| |
| Sometimes, it is useful to be able to merge sort keys. One example is having |
| separate sort keys for first and last names. If you need to perform an operation |
| that requires a sort key generated on the whole name, instead of concatenating |
| strings and regenerating sort keys, you should merge the sort keys. The merging |
| is done by merging the corresponding levels while inserting a terminator between |
| merged parts. The reserved sort key byte value for the merge terminator is 0x02. |
| For more details see [UCA section 1.6, Merging Sort |
| Keys](http://www.unicode.org/reports/tr10/#Interleaved_Levels). |
| |
| * C API: unicode/ucol.h `ucol_mergeSortkeys()` |
| * Java API: `com.ibm.icu.text.CollationKey merge(CollationKey source)` |
| |
| CLDR 1.9/ICU 4.6 and later map U+FFFE to a special collation element that is |
| intended to allow concatenating strings like firstName+\\uFFFE+lastName to yield |
| the same results as merging their individual sort keys. |
| This has been fully implemented in ICU since version 53. |
| |
| ### Generating bounds for a sort key (prefix matching) |
| |
| Having sort keys for strings allows for easy creation of bounds - sort keys that |
| are guaranteed to be smaller or larger than any sort key from a give range. For |
| example, if bounds are produced for a sortkey of string "smith", strings between |
| upper and lower bounds with one level would include "Smith", "SMITH", "sMiTh". |
| Two kinds of upper bounds can be generated - the first one will match only |
| strings of equal length, while the second one will match all the strings with |
| the same initial prefix. |
| |
| CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with the maximum |
| primary weight, so that for example the string "smith\\uFFFF" can be used as the |
| upper bound rather than modifying the sort key for "smith". |
| |
| ## Collation Element Iterator |
| |
| The collation element iterator is used for traversing Unicode string collation |
| elements one at a time. It can be used to implement language-sensitive text |
| search algorithms like Boyer-Moore. |
| |
| For most applications, the two API categories, compare and sort key, are |
| sufficient. Most people do not need to manipulate collation elements directly. |
| |
| Example: |
| |
| Consider iterating over "apple" and "äpple". Here are sequences of collation |
| elements: |
| |
| String 1 | String 1 Collation Elements |
| -------- | --------------------------- |
| a | `[1900.05.05]` |
| p | `[3700.05.05]` |
| p | `[3700.05.05]` |
| l | `[2F00.05.05]` |
| e | `[2100.05.05]` |
| |
| String 2 | String 2 Collation Elements |
| -------- | --------------------------- |
| a | `[1900.05.05]` |
| \\u0308 | `[0000.9D.05]` |
| p | `[3700.05.05]` |
| p | `[3700.05.05]` |
| l | `[2F00.05.05]` |
| e | `[2100.05.05]` |
| |
| The resulting CEs are typically masked according to the desired strength, and |
| zero CEs are discarded. In the above example, masking with 0xFFFF0000 (for primary strength) |
| produces the results of NULL secondary and tertiary differences. The collator then |
| ignores the NULL differences and declares a match. For more details see the |
| paper "Efficient text searching in Java™: Finding the right string in any |
| language" by Laura Werner ( |
| <http://icu-project.org/docs/papers/efficient_text_searching_in_java.html>). |
| |
| ## Collation Attributes |
| |
| The Collation Service has a number of attributes whose values can be changed |
| during run time. These attributes affect both the functionality and the |
| performance of the Collation Service. This section describes these |
| attributes and, where possible, their performance impact. Performance |
| indications are only approximate and timings may vary significantly depending on |
| the CPU, compiler, etc. |
| |
| Although string comparison by ICU and comparison of each string's sort key give |
| the same results, attribute settings can impact the execution time of each |
| method differently. To be precise in the discussion of performance, this section |
| refers to the API employed in the measurement. The `ucol_strcoll` function is the |
| API for string comparison. The `ucol_getSortKey` function is used to create sort |
| keys. |
| |
| > :point_right: **Note** There is a special attribute value, `UCOL_DEFAULT`, |
| > that can be used to set any attribute to its default value |
| > (which is inherited from the UCA and the tailoring). |
| |
| ### Attribute Types |
| |
| #### Strength level |
| |
| Collation strength, or the maximum collation level used for comparison, is set |
| by using the `UCOL_STRENGTH` attribute. Valid values are: |
| |
| 1. `UCOL_PRIMARY` |
| |
| 2. `UCOL_SECONDARY` |
| |
| 3. `UCOL_TERTIARY` (default) |
| |
| 4. `UCOL_QUATERNARY` |
| |
| 5. `UCOL_IDENTICAL` |
| |
| #### French collation |
| |
| The `UCOL_FRENCH_COLLATION` attribute determines whether to sort the secondary |
| differences in reverse order. Valid values are: |
| |
| 1. `UCOL_OFF` (default): compares secondary differences in the order they appear |
| in the string. |
| |
| 2. `UCOL_ON`: causes secondary differences to be considered in reverse order, as |
| it is done in the French language. |
| |
| #### Normalization mode |
| |
| The `UCOL_NORMALIZATION_MODE` attribute, or its alias `UCOL_DECOMPOSITION_MODE`, |
| controls whether text normalization is performed on the input strings. Valid |
| values are: |
| |
| 1. `UCOL_OFF` (default): turns off normalization check |
| |
| 2. `UCOL_ON` : normalization is checked and the collator performs normalization |
| if it is needed. |
| |
| X | FCD | NFC | NFD |
| --------------------- | --- | --- | --- |
| A-ring | Y | Y | |
| Angstrom | Y | | |
| A + ring | Y | | Y |
| A + grave | Y | Y | |
| A-ring + grave | Y | | |
| A + cedilla + ring | Y | | Y |
| A + ring + cedilla | | | |
| A-ring + cedilla | | Y | |
| |
| With normalization mode turned on, the `ucol_strcoll` function slows down by 10%. |
| In addition, the time to generate a sort key also increases by about 25%. |
| |
| #### Alternate handling |
| |
| This attribute allows shifting of the variable characters (usually spaces and |
| punctuation, in the UCA also most symbols) from the primary to the quaternary |
| strength level. This is set by using the `UCOL_ALTERNATE_HANDLING` attribute. For |
| details see [UCA: Variable |
| Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting), [LDML: |
| Collation |
| Settings](http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings), |
| and [“Ignore Punctuation” Options](customization/ignorepunct.md). |
| |
| 1. `UCOL_NON_IGNORABLE` (CLDR/ICU default): variable characters are treated as |
| all the other characters |
| |
| 2. `UCOL_SHIFTED` (UCA default): all the variable characters will be ignored at |
| the primary, secondary and tertiary levels and their primary strengths will |
| be shifted to the quaternary level. |
| |
| #### Case Ordering |
| |
| Some conventions require uppercase letters to sort before lowercase ones, while |
| others require the opposite. This attribute is controlled by the value of the |
| `UCOL_CASE_FIRST`. The case difference in the UCA is contained in the tertiary |
| weights along with other appearance characteristics (like circling of letters). |
| The case-first attribute allows for emphasizing of the case property of the |
| letters by reordering the tertiary weights with either upper-first, and/or |
| lowercase-first. This difference gets the most significant bit in the weight. |
| Valid values for this attribute are: |
| |
| 1. `UCOL_OFF` (default): leave tertiary weights unaffected |
| |
| 2. `UCOL_LOWER_FIRST`: causes lowercase letters and uncased characters to sort |
| before uppercase |
| |
| 3. `UCOL_UPPER_FIRST` : causes uppercase letters to sort first |
| |
| The case-first attribute does not affect the performance substantially. |
| |
| #### Case level |
| |
| When this attribute is set, an additional level is formed between the secondary |
| and tertiary levels, known as the Case Level. The case level is used to |
| distinguish large and small Japanese Kana characters. Case level could also be |
| used in other situations. for example to distinguish certain Pinyin characters. |
| Case level is controlled by `UCOL_CASE_LEVEL` attribute. Valid values for this |
| attribute are |
| |
| 1. `UCOL_OFF` (default): no additional case level |
| |
| 2. `UCOL_ON` : adds a case level |
| |
| #### Hiragana Quaternary |
| |
| *This setting is deprecated and ignored in recent versions of ICU.* |
| |
| Hiragana Quaternary can be set to `UCOL_ON`, in which case Hiragana code points |
| will sort before everything else on the quaternary level. If set to `UCOL_OFF` |
| Hiragana letters are treated the same as all the other code points. This setting |
| can be changed on run-time using the `UCOL_HIRAGANA_QUATERNARY_MODE` attribute. |
| You probably won't need to use it. |
| |
| #### Variable Top |
| |
| Variable Top is a boundary which decides whether the code points will be treated |
| as variable (shifted to quaternary level in the **shifted** mode) or |
| non-ignorable. Special APIs are used for setting of variable top. It can |
| basically be set either to a codepoint or a primary strength value. |
| |
| ## Performance |
| |
| ICU collation is designed to be fast, small and customizable. Several techniques |
| are used to enhance the performance: |
| |
| 1. Providing optimized processing for Latin characters. |
| |
| 2. Comparing strings incrementally and stopping at the first significant |
| difference. |
| |
| 3. Tuning to eliminate unnecessary file access or memory allocation. |
| |
| 4. Providing efficient preflight functions that allows fast sort key size |
| generation. |
| |
| 5. Using a single, shared copy of UCA in memory for the read-only default sort |
| order. Only small tailoring tables are kept in memory for locale-specific |
| customization. |
| |
| 6. Compressing sort keys efficiently. |
| |
| 7. Making the sort order be data-driven. |
| |
| In general, the best performance from the Collation Service is expected by |
| doing the following: |
| |
| 1. After opening a collator, keep and reuse it until done. Do not open new |
| collators for the same sort order. (Note the restriction on |
| multi-threading.) |
| |
| 2. Use `ucol_strcoll` etc. when comparing strings. If it is necessary to |
| compare strings thousands or millions of times, |
| create the sort keys first and compare the sort keys instead. |
| Generating the sort keys of two strings is about 5-10 |
| times slower than just comparing them directly. |
| |
| 3. Follow the best practice guidelines for generating sort keys. Do not call |
| `ucol_getSortKey` twice to first size the key and then allocate the sort key |
| buffer and repeat the call to the function to fill in the buffer. |
| |
| ### Performance and Storage Implications of Attributes |
| |
| Most people use the default attributes when comparing strings or when creating |
| sort keys. When they do want to customize the ordering, the most common options |
| are the following : |
| |
| `UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED`\ |
| Used to ignore space and punctuation characters |
| |
| `UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED` **and** `UCOL_STRENGTH == UCOL_QUATERNARY`\ |
| Used to ignore the space and punctuation characters except when there are no previous letter, accent, or case/variable differences. |
| |
| `UCOL_CASE_FIRST == UCOL_LOWER_FIRST` **or** `UCOL_CASE_FIRST == UCOL_UPPER_FIRST`\ |
| Used to change the ordering of upper vs. lower case letters (as |
| well as small vs. large kana) |
| |
| `UCOL_CASE_LEVEL == UCOL_ON` **and** `UCOL_STRENGTH == UCOL_PRIMARY`\ |
| Used to ignore only the accent differences. |
| |
| `UCOL_NORMALIZATION_MODE == UCOL_ON`\ |
| Force to always check for normalization. This |
| is used if the input text may not be in FCD form. |
| |
| `UCOL_FRENCH_COLLATION == UCOL_OFF`\ |
| This is only useful for languages like French and Catalan that may turn this attribute on. |
| (It is the default only for Canadian French ("fr-CA").) |
| |
| In String Comparison, most of these options have little or no effect on |
| performance. The only noticeable one is normalization, which can cost 10%-40% in |
| performance. |
| |
| For Sort Keys, most of these options either leave the storage alone or reduce |
| it. Shifting can reduce the storage by about 10%-20%; case level + primary-only |
| can decrease it about 20% to 40%. Using no French accents can reduce the storage |
| by about 38% , but only for languages like French and Catalan that turn it on by |
| default. On the other hand, using Shifted + Quaternary can increase the storage by |
| 10%-15%. (The Identical Level also increases the length, but this option is not |
| recommended). |
| |
| > :point_right: **Note** All of the above numbers are based on |
| > tests run on a particular machine, with a particular set of data. |
| > (The data for each language is a large number of names |
| > in that language in the format <first_name>, <last name>.) |
| > The performance and storage may vary, depending on the particular computer, |
| > operating system, and data. |
| |
| ## Versioning |
| |
| Sort keys are often stored on disk for later reuse. A common example is the use |
| of keys to build indexes in databases. When comparing keys, it is important to |
| know that both keys were generated by the same algorithms and weightings. |
| Otherwise, identical strings with keys generated on two different dates, for |
| example, might compare as unequal. Sort keys can be affected by new versions of |
| ICU or its data tables, new sort key formats, or changes to the Collator. |
| Starting with release 1.8.1, ICU provides a versioning mechanism to identify the |
| version information of the following (but not limited to), |
| |
| 1. The run-time executable |
| |
| 2. The collation element content |
| |
| 3. The Unicode/UCA database |
| |
| 4. The tailoring table |
| |
| The version information of Collator is a 32-bit integer. If a new version of ICU |
| has changes affecting the content of collation elements, the version information |
| will be changed. In that case, to use the new version of ICU collator will |
| require regenerating any saved or stored sort keys. |
| |
| However, it is possible to modify ICU code or data without changing relevant version numbers, |
| so it is safer to regenerate sort keys any time after any part of ICU has been updated. |
| |
| Since ICU4C 1.8.1. |
| it is possible to build your program so that it uses more than one version of |
| ICU (only in C/C++, not in Java). Therefore, you could use the current version |
| for the features you need and use the older version for collation. |
| |
| ## Programming Examples |
| |
| See the [Collation Examples](examples.md) chapter for an example of how to |
| compare and create sort keys with the default locale in C, C++ and Java. |