| <!-- |
| © 2020 and later: Unicode, Inc. and others. |
| License & terms of use: http://www.unicode.org/copyright.html |
| --> |
| |
| # Collation Concepts |
| |
| The previous section demonstrated many of the requirements imposed on string |
| comparison routines that try to correctly collate strings according to |
| conventions of more than a hundred different languages, written in many |
| different scripts. This section describes the principles and architecture behind |
| the ICU Collation Service. |
| |
| ## Sortkeys vs Comparison |
| |
| Sort keys are most useful in databases, where the overhead of calling a function |
| for each comparison is very large. |
| |
| Generating a sort key from a Collator is many times more expensive than doing a |
| compare with the Collator (for common use cases). That's if the two functions |
| are called from Java or C. So for those languages, unless there is a very large |
| number of comparisons, it is better to call the compare function. |
| |
| Here is an example, with a little back-of-the-envelope calculation. Let's |
| suppose that with a given language on a given platform, the compare performance |
| (CP) is 100 faster than sortKey performance (SP), and that you are doing a |
| binary search of a list with 1,000 elements. The binary comparison performance |
| is BP. We'd do about 10 comparisons, getting: |
| |
| compare: 10 \* CP |
| |
| sortkey: 1 \* SP + 10 \* BP |
| |
| Even if BP is free, compare would be better. One has to get up to where log2(n) |
| = 100 before they break even. |
| |
| But even this calculation is only a rough guide. First, the binary comparison is |
| not completely free. Secondly, the performance of compare function varies |
| radically with the source data. We optimized for maximizing performance of |
| collation in sorting and binary search, so comparing strings that are "close" is |
| optimized to be much faster than comparing strings that are "far away". That |
| optimization is important because normal sort/lookup operations compare close |
| strings far more often -- think of binary search, where the last few comparisons |
| are always with the closest strings. So even the above calculation is not very |
| accurate. |
| |
| ## Comparison Levels |
| |
| In general, when comparing and sorting objects, some properties can take |
| precedence over others. For example, in geometry, you might consider first the |
| number of sides a shape has, followed by the number of sides of equal length. |
| This causes triangles to be sorted together, then rectangles, then pentagons, |
| etc. Within each category, the shapes would be ordered according to whether they |
| had 0, 2, 3 or more sides of the same length. However, this is not the only way |
| the shapes can be sorted. For example, it might be preferable to sort shapes by |
| color first, so that all red shapes are grouped together, then blue, etc. |
| Another approach would be to sort the shapes by the amount of area they enclose. |
| |
| Similarly, character strings have properties, some of which can take precedence |
| over others. There is more than one way to prioritize the properties. |
| |
| For example, a common approach is to distinguish characters first by their |
| unadorned base letter (for example, without accents, vowels or tone marks), then |
| by accents, and then by the case of the letter (upper vs. lower). Ideographic |
| characters might be sorted by their component radicals and then by the number of |
| strokes it takes to draw the character. |
| An alternative ordering would be to sort these characters by strokes first and |
| then by their radicals. |
| |
| The ICU Collation Service supports many levels of comparison (named "Levels", |
| but also known as "Strengths"). Having these categories enables ICU to sort |
| strings precisely according to local conventions. However, by allowing the |
| levels to be selectively employed, searching for a string in text can be |
| performed with various matching conditions. |
| |
| Performance optimizations have been made for ICU collation with the default |
| level settings. Performance specific impacts are discussed in the Performance |
| section below. |
| |
| Following is a list of the names for each level and an example usage: |
| |
| 1. Primary Level: Typically, this is used to denote differences between base |
| characters (for example, "a" < "b"). It is the strongest difference. For |
| example, dictionaries are divided into different sections by base character. |
| This is also called the level-1 strength. |
| |
| 2. Secondary Level: Accents in the characters are considered secondary |
| differences (for example, "as" < "às" < "at"). Other differences between |
| letters can also be considered secondary differences, depending on the |
| language. A secondary difference is ignored when there is a primary |
| difference anywhere in the strings. This is also called the level-2 |
| strength. |
| Note: In some languages (such as Danish), certain accented letters are |
| considered to be separate base characters. In most languages, however, an |
| accented letter only has a secondary difference from the unaccented version |
| of that letter. |
| |
| 3. Tertiary Level: Upper and lower case differences in characters are |
| distinguished at the tertiary level (for example, "ao" < "Ao" < "aò"). In |
| addition, a variant of a letter differs from the base form on the tertiary |
| level (such as "A" and "Ⓐ"). Another example is the difference between large |
| and small Kana. A tertiary difference is ignored when there is a primary or |
| secondary difference anywhere in the strings. This is also called the |
| level-3 strength. |
| |
| 4. Quaternary Level: When punctuation is ignored (see Ignoring Punctuations |
| (§)) at level 1-3, an additional level can be used to distinguish words with |
| and without punctuation (for example, "ab" < "a-b" < "aB"). This difference |
| is ignored when there is a primary, secondary or tertiary difference. This |
| is also known as the level-4 strength. The quaternary level should only be |
| used if ignoring punctuation is required or when processing Japanese text |
| (see Hiragana processing (§)). |
| |
| 5. Identical Level: When all other levels are equal, the identical level is |
| used as a tiebreaker. The Unicode code point values of the NFD form of each |
| string are compared at this level, just in case there is no difference at |
| levels 1-4 . For example, Hebrew cantillation marks are only distinguished |
| at this level. This level should be used sparingly, as only code point |
| value differences between two strings is an extremely rare occurrence. |
| Using this level substantially decreases the performance for |
| both incremental comparison and sort key generation (as well as increasing |
| the sort key length). It is also known as level 5 strength. |
| |
| ## Backward Secondary Sorting |
| |
| Some languages require words to be ordered on the secondary level according to |
| the *last* accent difference, as opposed to the *first* accent difference. This |
| was previously the default for all French locales, based on some French |
| dictionary ordering traditions, but is currently only applicable to Canadian |
| French (locale **fr_CA**), for conformance with the [Canadian sorting |
| standard](http://www.unicode.org/reports/tr10/#CanStd). The difference in |
| ordering is only noticeable for a small number of pairs of real words. For more |
| information see [UCA: Contextual |
| Sensitivity](http://www.unicode.org/reports/tr10/#Contextual_Sensitivity). |
| |
| Example: |
| |
| Forward secondary | Backward secondary |
| ----------------- | ------------------ |
| cote | cote |
| coté | côte |
| côte | coté |
| côté | côté |
| |
| ## Contractions |
| |
| A contraction is a sequence consisting of two or more letters. It is considered |
| a single letter in sorting. |
| |
| For example, in the traditional Spanish sorting order, "ch" is considered a |
| single letter. All words that begin with "ch" sort after all other words |
| beginning with "c", but before words starting with "d". |
| |
| Other examples of contractions are "ch" in Czech, which sorts after "h", and |
| "lj" and "nj" in Croatian and Latin Serbian, which sort after "l" and "n" |
| respectively. |
| |
| Example: |
| |
| Order without contraction | Order with contraction "lj" sorting after letter "l" |
| ------------------------- | ---------------------------------------------------- |
| la | la |
| li | li |
| lj | lk |
| lja | lz |
| ljz | lj |
| lk | lja |
| lz | ljz |
| ma | ma |
| |
| Contracting sequences such as the above are not very common in most languages. |
| |
| > :point_right: **Note** Since ICU 2.2, and as required by the UCA, |
| > if a completely ignorable code point |
| > appears in text in the middle of contraction, it will not break the contraction. |
| > For example, in Czech sorting, cU+0000h will sort as it were ch. |
| |
| ## Expansions |
| |
| If a letter sorts as if it were a sequence of more than one letter, it is called |
| an expansion. |
| |
| For example, in German phonebook sorting (de@collation=phonebook or BCP 47 |
| de-u-co-phonebk), "ä" sorts as though it were equivalent to the sequence "ae." |
| All words starting with "ä" will sort between words starting with "ad" and words |
| starting with "af". |
| |
| In the case of Unicode encoding, characters can often be represented either as |
| pre-composed characters or in decomposed form. For example, the letter "à" can |
| be represented in its decomposed (a+\`) and pre-composed (à) form. Most |
| applications do not want to distinguish text by the way it is encoded. A search |
| for "à" should find all instances of the letter, regardless of whether the |
| instance is in pre-composed or decomposed form. Therefore, either form of the |
| letter must result in the same sort ordering. The architecture of the ICU |
| Collation Service supports this. |
| |
| ## Contractions Producing Expansions |
| |
| It is possible to have contractions that produce expansions. |
| |
| One example occurs in Japanese, where the vowel with a prolonged sound mark is |
| treated to be equivalent to the long vowel version: |
| |
| カアー<<< カイー and\ |
| キイー<<< キイー |
| |
| > :point_right: **Note** Since ICU 2.0 Japanese tailoring uses |
| > [prefix analysis](http://www.unicode.org/reports/tr35/tr35-collation.html#Context_Sensitive_Mappings) |
| > instead of contraction producing expansions. |
| |
| ## Normalization |
| |
| In the section on expansions, we discussed that text in Unicode can often be |
| represented in either pre-composed or decomposed forms. There are other types of |
| equivalences possible with Unicode, including Canonical and Compatibility. The |
| process of |
| Normalization ensures that text is written in a predictable way so that searches |
| are not made unnecessarily complicated by having to match on equivalences. Not |
| all text is normalized, however, so it is useful to have a collation service |
| that can address text that is not normalized, but do so with efficiency. |
| |
| The ICU Collation Service handles un-normalized text properly, producing the |
| same results as if the text were normalized. |
| |
| In practice, most data that is encountered is in normalized or semi-normalized |
| form already. The ICU Collation Service is designed so that it can process a |
| wide range of normalized or un-normalized text without a need for normalization |
| processing. When a case is encountered that requires normalization, the ICU |
| Collation Service drops into code specific to this purpose. This maximizes |
| performance for the majority of text that does not require normalization. |
| |
| In addition, if the text is known with certainty not to contain un-normalized |
| text, then even the overhead of checking for normalization can be eliminated. |
| The ICU Collation Service has the ability to turn Normalization Checking either |
| on or off. If Normalization Checking is turned off, it is the user's |
| responsibility to insure that all text is already in the appropriate form. This |
| is true in a great majority of the world languages, so normalization checking is |
| turned off by default for most locales. |
| |
| If the text requires normalization processing, Normalization Checking should be |
| on. Any language that uses multiple combining characters such as Arabic, ancient |
| Greek, Hebrew, Hindi, Thai or Vietnamese either requires Normalization Checking |
| to be on, or the text to go through a normalization process before collation. |
| |
| For more information about Normalization related reordering please see |
| [Unicode Technical Note #5](http://www.unicode.org/notes/tn5/) and |
| [UAX #15.](http://www.unicode.org/reports/tr15/) |
| |
| > :point_right: **Note** ICU supports two modes of normalization: on and off. |
| > Java.text.\* classes offer compatibility decomposition mode, which is not supported in ICU. |
| |
| ## Ignoring Punctuation |
| |
| In some cases, punctuation can be ignored while searching or sorting data. For |
| example, this enables a search for "biweekly" to also return instances of |
| "bi-weekly". In other cases, it is desirable for punctuated text to be |
| distinguished from text without punctuation, but to have the text sort close |
| together. |
| |
| These two behaviors can be accomplished if there is a way for a character to be |
| ignored on all levels except for the quaternary level. If this is the case, then |
| two strings which compare as identical on the first three levels (base letter, |
| accents, and case) are then distinguished at the fourth level based on their |
| punctuation (if any). If the comparison function ignores differences at the |
| fourth level, then strings that differ by punctuation only are compared as |
| equal. |
| |
| The following table shows the results of sorting a list of terms in 3 different |
| ways. In the first column, punctuation characters (space " ", and hyphen "-") |
| are not ignored (" " < "-" < "b"). In the second column, punctuation characters |
| are ignored in the first 3 levels and compared only in the fourth level. In the |
| third column, punctuation characters are ignored in the first 3 levels and the |
| fourth level is not considered. In the last column, punctuated terms are |
| equivalent to the identical terms without punctuation. |
| |
| For more options and details see the [“Ignore Punctuation” |
| Options](customization/ignorepunct.md) page. |
| |
| Non-ignorable | Ignorable and Quaternary strength | Ignorable and Tertiary strength |
| ------------- | --------------------------------- | ------------------------------- |
| black bird | black bird | **black bird** |
| black Bird | black-bird | **black-bird** |
| black birds | blackbird | **blackbird** |
| black-bird | black Bird | black Bird |
| black-Bird | black-Bird | black-Bird |
| black-birds | blackBird | blackBird |
| blackbird | black birds | black birds |
| blackBird | black-birds | black-birds |
| blackbirds | blackbirds | blackbirds |
| |
| > :point_right: **Note** The strings with the same font format in the last column are |
| compared as equal by ICU Collator.\ |
| > Since ICU 2.2 and as prescribed by the UCA, primary ignorable code points that |
| > follow shifted code points will be completely ignored. This means that an accent |
| > following a space will compare as if it was a space alone. |
| |
| ## Case Ordering |
| |
| The tertiary level is used to distinguish text by case, by small versus large |
| Kana, and other letter variants as noted above. |
| |
| Some applications prefer to emphasize case differences so that words starting |
| with the same case sort together. Some Japanese applications require the |
| difference between small and large Kana be emphasized over other tertiary |
| differences. |
| |
| The UCA does not provide means to separate out either case or Kana differences |
| from the remaining tertiary differences. However, the ICU Collation Service has |
| two options that help in customize case and/or Kana differences. Both options |
| are turned off by default. |
| |
| ### CaseFirst |
| |
| The Case-first option makes case the most significant part of the tertiary |
| level. Primary and secondary levels are unaffected. With this option, words |
| starting with the same case sort together. The Case-first option can be set to |
| make either lowercase sort before |
| uppercase or uppercase sort before lowercase. |
| |
| Note: The case-first option does not constitute a separate level; it is simply a |
| reordering of the tertiary level. |
| |
| ICU makes use of the following three case categories for sorting |
| |
| 1. uppercase: "ABC" |
| |
| 2. mixed case: "Abc", "aBc" |
| |
| 3. normal (lowercase or no case): "abc", "123" |
| |
| Mixed case is always sorted between uppercase and normal case when the |
| "case-first" option is set. |
| |
| ### CaseLevel |
| |
| The Case Level option makes a separate level for case differences. This is an |
| extra level positioned between secondary and tertiary. The case level is used in |
| Japanese to make the difference between small and large Kana more important than |
| the other tertiary differences. It also can be used to ignore other tertiary |
| differences, or even secondary differences. This is especially useful in |
| matching. For example, if the strength is set to primary only (level-1) and the |
| case level is turned on, the comparison ignores accents and tertiary differences |
| except for case. The contents of the case level are affected by the case-first |
| option. |
| |
| The case level is independent from the strength of comparison. It is possible to |
| have a collator set to primary strength with the case level turned on. This |
| provides for comparison that takes into account the case differences, while at |
| the same time ignoring accents and tertiary differences other than case. This |
| may be used in searching. |
| |
| Example: |
| |
| **Case-first off, Case level off** |
| |
| apple\ |
| ⓐⓟⓟⓛⓔ\ |
| Abernathy\ |
| ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ |
| ähnlich\ |
| Ähnlichkeit |
| |
| **Lowercase-first, Case level off** |
| |
| apple\ |
| ⓐⓟⓟⓛⓔ\ |
| ähnlich\ |
| Abernathy\ |
| ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ |
| Ähnlichkeit |
| |
| **Uppercase-first, Case level off** |
| |
| Abernathy\ |
| ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ |
| Ähnlichkeit\ |
| apple\ |
| ⓐⓟⓟⓛⓔ\ |
| ähnlich |
| |
| **Lowercase-first, Case level on** |
| |
| apple\ |
| Abernathy\ |
| ⓐⓟⓟⓛⓔ\ |
| ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ |
| ähnlich\ |
| Ähnlichkeit |
| |
| **Uppercase-first, Case level on** |
| |
| Abernathy\ |
| apple\ |
| ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ |
| ⓐⓟⓟⓛⓔ\ |
| Ähnlichkeit\ |
| ähnlich |
| |
| ## Script Reordering |
| |
| Script reordering allows scripts and some other groups of characters to be moved |
| relative to each other. This reordering is done on top of the DUCET/CLDR |
| standard collation order. Reordering can specify groups to be placed at the |
| start and/or the end of the collation order. |
| |
| By default, reordering codes specified for the start of the order are placed in |
| the order given after several special non-script blocks. These special groups of |
| characters are space, punctuation, symbol, currency, and digit. Script groups |
| can be intermingled with these special non-script groups if those special groups |
| are explicitly specified in the reordering. |
| |
| The special code `others` stands for any script that is not explicitly mentioned |
| in the list. Anything that is after others will go at the very end of the list |
| in the order given. For example, `[Grek, others, Latn]` will result in an |
| ordering that puts all scripts other than Greek and Latin between them. |
| |
| ### Examples: |
| |
| Note: All examples below use the string equivalents for the scripts and reorder |
| codes that would be used in collator rules. The script and reorder code |
| constants that would be used in API calls will be different. |
| |
| **Example 1:**\ |
| set reorder code - `[Grek]`\ |
| result - `[space, punctuation, symbol, currency, digit, Grek, others]` |
| |
| **Example 2:**\ |
| set reorder code - `[Grek]`\ |
| result - `[space, punctuation, symbol, currency, digit, Grek, others]` |
| |
| followed by: set reorder code - `[Hani]`\ |
| result -` [space, punctuation, symbol, currency, digit, Hani, others]` |
| |
| That is, setting a reordering always modifies |
| the DUCET/CLDR order, replacing whatever was previously set, rather than adding |
| on to it. In order to cumulatively modify an ordering, you have to retrieve the |
| existing ordering, modify it, and then set it. |
| |
| **Example 3:**\ |
| set reorder code - `[others, digit]`\ |
| result - `[space, punctuation, symbol, currency, others, digit]` |
| |
| **Example 4:**\ |
| set reorder code - `[space, Grek, punctuation]`\ |
| result - `[symbol, currency, digit, space, Grek, punctuation, others]` |
| |
| **Example 5:**\ |
| set reorder code - `[Grek, others, Hani]`\ |
| result - `[space, punctuation, symbol, currency, digit, Grek, others, Hani]` |
| |
| **Example 6:**\ |
| set reorder code - `[Grek, others, Hani, symbol, Tglg]`\ |
| result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]` |
| |
| followed by:\ |
| set reorder code - `[NONE]`\ |
| result - DUCET/CLDR |
| |
| **Example 7:**\ |
| set reorder code - `[Grek, others, Hani, symbol, Tglg]`\ |
| result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]` |
| |
| followed by:\ |
| set reorder code - `[DEFAULT]`\ |
| result - original reordering for the locale which may or may not be DUCET/CLDR |
| |
| **Example 8:**\ |
| set reorder code - `[Grek, others, Hani, symbol, Tglg]`\ |
| result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]` |
| |
| followed by:\ |
| set reorder code - `[]`\ |
| result - original reordering for the locale which may or may not be DUCET/CLDR |
| |
| **Example 9:**\ |
| set reorder code - `[Hebr, Phnx]`\ |
| result - error |
| |
| Beginning with ICU 55, scripts only reorder together if they are primary-equal, |
| for example Hiragana and Katakana. |
| |
| ICU 4.8-54: |
| |
| * Scripts were reordered in groups, each normally starting with a [Recommended |
| Script](http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts). |
| * Reorder codes moved as a group (were “equivalent”) if their scripts shared a |
| primary-weight lead byte. |
| * For example, Hebr and Phnx were “equivalent” reordering codes and were |
| reordered together. Their order relative to each other could not be changed. |
| * Only any one code out of any group could be reordered, not multiple of the |
| same group. |
| |
| ## Sorting of Japanese Text (JIS X 4061) |
| |
| Japanese standard JIS X 4061 requires two changes to the collation procedures: |
| special processing of Hiragana characters and (for performance reasons) prefix |
| analysis of text. |
| |
| ### Hiragana Processing |
| |
| JIS X 4061 standard requires more levels than provided by the UCA. To offer |
| conformant sorting order, ICU uses the quaternary level to distinguish between |
| Hiragana and Katakana. Hiragana symbols are given smaller values than Katakana |
| symbols on quaternary level, thus causing Hiragana sequences to sort before |
| corresponding Katakana sequences. |
| |
| ### Prefix Analysis |
| |
| Another characteristics of sorting according to the JIS X 4061 is a large number |
| of contractions followed by expansions (see |
| [Contractions Producing Expansions](#contractions-producing-expansions)). |
| This causes all the Hiragana and Katakana codepoints to be treated as |
| contractions, which reduces performance. The solution we adopted introduces the |
| prefix concept which allows us to improve the performance of Japanese sorting. |
| More about this can be found in the [customization |
| chapter](customization/index.md) . |
| |
| ## Thai/Lao reordering |
| |
| UCA requires that certain Thai and Lao prevowels be reordered with a code point |
| following them. This option is always on in the ICU implementation, as |
| prescribed by the UCA. |
| |
| This rule takes effect when: |
| |
| 1. A Thai vowel of the range \\U0E40-\\U0E44 precedes a Thai consonant of the |
| range \\U0E01-\\U0E2E |
| or |
| |
| 2. A Lao vowel of the range \\U0EC0-\\U0EC4 precedes a Lao consonant of the |
| range \\U0E81-\\U0EAE. In these cases the vowel is placed after the |
| consonant for collation purposes. |
| |
| > :point_right: **Note** There is a difference between java.text.\* classes and ICU in regard to Thai |
| > reordering. Java.text.\* classes allow tailorings to turn off reordering by |
| > using the '!' modifier. ICU ignores the '!' modifier and always reorders Thai |
| > prevowels. |
| |
| ## Space Padding |
| |
| In many database products, fields are padded with null. To get correct results, |
| the input to a Collator should omit any superfluous trailing padding spaces. The |
| problem arises with contractions, expansions, or normalization. Suppose that |
| there are two fields, one containing "aed" and the other with "äd". German |
| phonebook sorting (de@collation=phonebook or BCP 47 de-u-co-phonebk) will |
| compare "ä" as if it were "ae" (on a primary level), so the order will be "äd" < |
| "aed". But if both fields are padded with spaces to a length of 3, then this |
| will reverse the order, since the first will compare as if it were one character |
| longer. In other words, when you start with strings 1 and 2 |
| |
| 1 | a | e | d | \<space\> |
| -- | -- | -- | --------- | --------- |
| 2 | ä | d | \<space\> | \<space\> |
| |
| they end up being compared on a primary level as if they were 1' and 2' |
| |
| 1' | a | e | d | \<space\> | |
| -- | -- | -- | -- | --------- | --------- |
| 2' | a | e | d | \<space\> | \<space\> |
| |
| Since 2' has an extra character (the extra space), it counts as having a primary |
| difference when it shouldn't. The correct result occurs when the trailing |
| padding spaces are removed, as in 1" and 2" |
| |
| 1" | a | e | d |
| -- | -- | -- | -- |
| 2" | a | e | d |
| |
| ## Collator naming scheme |
| |
| ***Starting with ICU 54, the following naming scheme and its API functions are |
| deprecated.*** Use ucol_open() with language tag collation keywords instead (see |
| [Collation API Details](api.md)). For example, |
| ucol_open("de-u-co-phonebk-ka-shifted", &errorCode) for German Phonebook order |
| with "ignore punctuation" mode. |
| |
| When collating or matching text, a number of attributes can be used to affect |
| the desired result. The following describes the attributes, their values, their |
| effects, their normal usage, and the string comparison performance and sort key |
| length implications. It also includes single-letter abbreviations for both the |
| attributes and their values. These abbreviations allow a 'short-form' |
| specification of a set of collation options, such as "UCA4.0.0_AS_LSV_S", which |
| can be used to specific that the desired options are: UCA version 4.0.0; ignore |
| spaces, punctuation and symbols; use Swedish linguistic conventions; compare |
| case-insensitively. |
| |
| A number of attribute values are common across different attributes; these |
| include **Default** (abbreviated as D), **On** (O), and **Off** (X). Unless |
| otherwise stated, the examples use the UCA alone with default settings. |
| |
| > :point_right: **Note** In order to achieve uniqueness, a collator name always |
| > has the attribute abbreviations sorted. |
| |
| ### Main References |
| |
| 1. For a full list of supported locales in ICU, see [Locale |
| Explorer](http://demo.icu-project.org/icu-bin/locexp) , which also contains |
| an on-line demo showing sorting for each locale. The demo allows you to try |
| different attribute values, to see how they affect sorting. |
| |
| 2. To see tabular results for the UCA table itself, see the [Unicode Collation |
| Charts](http://www.unicode.org/charts/collation/) . |
| |
| 3. For the UCA specification, see [UTS #10: Unicode Collation |
| Algorithm](http://www.unicode.org/reports/tr10/) . |
| |
| 4. For more detail on the precise effects of these options, see [Collation |
| Customization](customization/index.md) . |
| |
| #### Collator Naming Attributes |
| |
| Attribute | Abbreviation | Possible Values |
| ---------------------- | ------------ | --------------- |
| Locale | L | \<language\> |
| Script | Z | \<script\> |
| Region | R | \<region\> |
| Variant | V | \<variant\> |
| Keyword | K | \<keyword\> |
| | | |
| Strength | S | 1, 2, 3, 4, I, D |
| Case_Level | E | X, O, D |
| Case_First | C | X, L, U, D |
| Alternate | A | N, S, D |
| Variable_Top | T | \<hex digits\> |
| Normalization Checking | N | X, O, D |
| French | F | X, O, D |
| Hiragana | H | X, O, D |
| |
| #### Collator Naming Attribute Descriptions |
| |
| The **Locale** attribute is typically the most |
| important attribute for correct sorting and matching, according to the user |
| expectations in different countries and regions. The default UCA ordering will |
| only sort a few languages such as Dutch and Portuguese correctly ("correctly" |
| meaning according to the normal expectations for users of the languages). |
| Otherwise, you need to supply the locale to UCA in order to properly collate |
| text for a given language. Thus a locale needs to be supplied so as to choose a |
| collator that is correctly **tailored** for that locale. The choice of a locale |
| will automatically preset the values for all of the attributes to something that |
| is reasonable for that locale. Thus most of the time the other attributes do not |
| need to be explicitly set. In some cases, the choice of locale will make a |
| difference in string comparison performance and/or sort key length. |
| |
| In short attribute names, |
| `<language>_<script>_<region>_<variant>@collation=<keyword>` is |
| represented by: `L<language>_Z<script>_R<region>_V<variant>_K<keyword>`. Not |
| all the elements are required. Valid values for locale elements are general |
| valid values for RFC 3066 locale naming. |
| |
| **Example:**\ |
| **Locale="sv" (Swedish)** "Kypper" < "Köpfe"\ |
| **Locale="de" (German)** "Köpfe" < "Kypper" |
| |
| The **Strength** attribute determines whether accents or |
| case are taken into account when collating or matching text. ( (In writing |
| systems without case or accents, it controls similarly important features). The |
| default strength setting usually does not need to be changed for collating |
| (sorting), but often needs to be changed when **matching** (e.g. SELECT). The |
| possible values include Default (D), Primary (1), Secondary (2), Tertiary (3), |
| Quaternary (4), and Identical (I). |
| |
| For example, people may choose to ignore accents or ignore accents and case when |
| searching for text. |
| |
| Almost all characters are distinguished by the first three levels, and in most |
| locales the default value is thus Tertiary. However, if Alternate is set to be |
| Shifted, then the Quaternary strength (4) can be used to break ties among |
| whitespace, punctuation, and symbols that would otherwise be ignored. If very |
| fine distinctions among characters are required, then the Identical strength (I) |
| can be used (for example, Identical Strength distinguishes between the |
| **Mathematical Bold Small A** and the **Mathematical Italic Small A.** For more |
| examples, look at the cells with white backgrounds in the collation charts). |
| However, using levels higher than Tertiary - the Identical strength - result in |
| significantly longer sort keys, and slower string comparison performance for |
| equal strings. |
| |
| **Example:**\ |
| **S=1** role = Role = rôle\ |
| **S=2** role = Role < rôle\ |
| **S=3** role < Role < rôle |
| |
| The **Case_Level** attribute is used when ignoring accents |
| **but not** case. In such a situation, set Strength to be Primary, and |
| Case_Level to be On. In most locales, this setting is Off by default. There is a |
| small string comparison performance and sort key impact if this attribute is set |
| to be On. |
| |
| **Example:**\ |
| **S=1, E=X** role = Role = rôle\ |
| **S=1, E=O** role = rôle < Role |
| |
| The **Case_First** attribute is used to control whether |
| uppercase letters come before lowercase letters or vice versa, in the absence of |
| other differences in the strings. The possible values are Uppercase_First (U) |
| and Lowercase_First (L), plus the standard Default and Off. There is almost no |
| difference between the Off and Lowercase_First options in terms of results, so |
| typically users will not use Lowercase_First: only Off or Uppercase_First. |
| (People interested in the detailed differences between X and L should consult |
| the [Collation Customization](customization/index.md) ). |
| Specifying either L or U won't affect string comparison performance, but will |
| affect the sort key length. |
| |
| **Example:**\ |
| **C=X or C=L** "china" < "China" < "denmark" < "Denmark"\ |
| **C=U** "China" < "china" < "Denmark" < "denmark" |
| |
| The **Alternate** attribute is used to control the handling of |
| the so-called **variable **characters in the UCA: whitespace, punctuation and |
| symbols. If Alternate is set to Non-Ignorable (N), then differences among these |
| characters are of the same importance as differences among letters. If Alternate |
| is set to Shifted (S), then these characters are of only minor importance. The |
| Shifted value is often used in combination with Strength set to Quaternary. In |
| such a case, white-space, punctuation, and symbols are considered when comparing |
| strings, but only if all other aspects of the strings (base letters, accents, |
| and case) are identical. If Alternate is not set to Shifted, then there is no |
| difference between a Strength of 3 and a Strength of 4. |
| |
| For more information and examples, see |
| [Variable_Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting) in |
| the UCA. |
| |
| The reason the Alternate values are not simply On and Off is that |
| additional Alternate values may be added in the future. |
| |
| The UCA option |
| **Blanked** is expressed with Strength set to 3, and Alternate set to Shifted. |
| |
| The default for most locales is Non-Ignorable. If Shifted is selected, it may be |
| slower if there are many strings that are the same except for punctuation; sort |
| key length will not be affected unless the strength level is also increased. |
| |
| **Example:**\ |
| **S=3, A=N** di Silva < Di Silva < diSilva < U.S.A. < USA\ |
| **S=3, A=S** di Silva = diSilva < Di Silva < U.S.A. = USA\ |
| **S=4, A=S** di Silva < diSilva < Di Silva < U.S.A. < USA |
| |
| The **Variable_Top** attribute is only meaningful if the |
| Alternate attribute is not set to Non-Ignorable. In such a case, it controls |
| which characters count as ignorable. The \<hex\> value specifies the "highest" |
| character sequence (in UCA order) weight that is to be considered ignorable. |
| |
| Thus, for example, if a user wanted white-space to be ignorable, but not any |
| visible characters, then s/he would use the value Variable_Top=0020 (space). The |
| digits should only be a single character. All characters of the same primary |
| weight are equivalent, so Variable_Top=3000 (ideographic space) has the same |
| effect as Variable_Top=0020. |
| |
| This setting (alone) has little impact on string comparison performance; setting |
| it lower or higher will make sort keys slightly shorter or longer respectively. |
| |
| **Example:**\ |
| **S=3, A=S** di Silva = diSilva < U.S.A. = USA\ |
| **S=3, A=S, T=0020** di Silva = diSilva < U.S.A. < USA |
| |
| The **Normalization** setting determines whether |
| text is thoroughly normalized or not in comparison. Even if the setting is off |
| (which is the default for many locales), text as represented in common usage |
| will compare correctly (for details, see [UTN |
| #5](http://www.unicode.org/notes/tn5/)). Only if the accent marks are in |
| non-canonical order will there be a problem. If the setting is On, then the best |
| results are guaranteed for all possible text input.There is a medium string |
| comparison performance cost if this attribute is On, depending on the frequency |
| of sequences that require normalization. There is no significant effect on sort |
| key length.If the input text is known to be in NFD or NFKD normalization forms, |
| there is no need to enable this Normalization option. |
| |
| **Example:**\ |
| **N=X** ä = a + ◌̈ < ä + ◌̣ < ạ + ◌̈\ |
| **N=O** ä = a + ◌̈ < ä + ◌̣ = ạ + ◌̈ |
| |
| Some **French** dictionary ordering traditions sort strings with |
| different accents from the back of the string. This attribute is automatically |
| set to On for the Canadian French locale (fr_CA). Users normally would not need |
| to explicitly set this attribute. There is a string comparison performance cost |
| when it is set On, but sort key length is unaffected. |
| |
| **Example:**\ |
| **F=X** cote < coté < côte < côté\ |
| **F=O** cote < côte < coté < côté |
| |
| Compatibility with JIS x 4061 requires the introduction of an |
| additional level to distinguish **Hiragana** and Katakana characters. If |
| compatibility with that standard is required, then this attribute is set On, and |
| the strength should be set to at least Quaternary. |
| |
| This attribute is an implementation detail of the CLDR Japanese tailoring. The |
| implementation might change to use a different mechanism to achieve the same |
| Japanese sort order. Since ICU 50, this attribute is not settable any more. |
| |
| **Example:**\ |
| **H=X, S=4** きゅう = キュウ < きゆう = キユウ\ |
| **H=O, S=4** きゅう < キュウ < きゆう < キユウ |
| |
| > :point_right: **Note** If attributes in collator name are not overridden, |
| > it is assumed that they are the same as for the given locale. |
| > For example, a collator opened with an empty |
| > string has the same attribute settings as **AN_CX_EX_FX_HX_KX_NX_S3_T0000**.* |
| |
| ### Summary of Value Abbreviations |
| |
| Value | Abbreviation |
| ------------- | ------------ |
| Default | D |
| On | O |
| Off | X |
| Primary | 1 |
| Secondary | 2 |
| Tertiary | 3 |
| Quaternary | 4 |
| Identical | I |
| Shifted | S |
| Non-Ignorable | N |
| Lower-First | L |
| Upper-First | U |