docs/userguide/collation/concepts.md - external/github.com/unicode-org/icu - Git at Google

 <!--
 © 2020 and later: Unicode, Inc. and others.
 License & terms of use: http://www.unicode.org/copyright.html
 -->

 # Collation Concepts

 The previous section demonstrated many of the requirements imposed on string
 comparison routines that try to correctly collate strings according to
 conventions of more than a hundred different languages, written in many
 different scripts. This section describes the principles and architecture behind
 the ICU Collation Service.

 ## Sortkeys vs Comparison

 Sort keys are most useful in databases, where the overhead of calling a function
 for each comparison is very large.

 Generating a sort key from a Collator is many times more expensive than doing a
 compare with the Collator (for common use cases). That's if the two functions
 are called from Java or C. So for those languages, unless there is a very large
 number of comparisons, it is better to call the compare function.

 Here is an example, with a little back-of-the-envelope calculation. Let's
 suppose that with a given language on a given platform, the compare performance
 (CP) is 100 faster than sortKey performance (SP), and that you are doing a
 binary search of a list with 1,000 elements. The binary comparison performance
 is BP. We'd do about 10 comparisons, getting:

 compare: 10 \* CP

 sortkey: 1 \* SP + 10 \* BP

 Even if BP is free, compare would be better. One has to get up to where log2(n)
 = 100 before they break even.

 But even this calculation is only a rough guide. First, the binary comparison is
 not completely free. Secondly, the performance of compare function varies
 radically with the source data. We optimized for maximizing performance of
 collation in sorting and binary search, so comparing strings that are "close" is
 optimized to be much faster than comparing strings that are "far away". That
 optimization is important because normal sort/lookup operations compare close
 strings far more often -- think of binary search, where the last few comparisons
 are always with the closest strings. So even the above calculation is not very
 accurate.

 ## Comparison Levels

 In general, when comparing and sorting objects, some properties can take
 precedence over others. For example, in geometry, you might consider first the
 number of sides a shape has, followed by the number of sides of equal length.
 This causes triangles to be sorted together, then rectangles, then pentagons,
 etc. Within each category, the shapes would be ordered according to whether they
 had 0, 2, 3 or more sides of the same length. However, this is not the only way
 the shapes can be sorted. For example, it might be preferable to sort shapes by
 color first, so that all red shapes are grouped together, then blue, etc.
 Another approach would be to sort the shapes by the amount of area they enclose.

 Similarly, character strings have properties, some of which can take precedence
 over others. There is more than one way to prioritize the properties.

 For example, a common approach is to distinguish characters first by their
 unadorned base letter (for example, without accents, vowels or tone marks), then
 by accents, and then by the case of the letter (upper vs. lower). Ideographic
 characters might be sorted by their component radicals and then by the number of
 strokes it takes to draw the character.
 An alternative ordering would be to sort these characters by strokes first and
 then by their radicals.

 The ICU Collation Service supports many levels of comparison (named "Levels",
 but also known as "Strengths"). Having these categories enables ICU to sort
 strings precisely according to local conventions. However, by allowing the
 levels to be selectively employed, searching for a string in text can be
 performed with various matching conditions.

 Performance optimizations have been made for ICU collation with the default
 level settings. Performance specific impacts are discussed in the Performance
 section below.

 Following is a list of the names for each level and an example usage:

 1.  Primary Level: Typically, this is used to denote differences between base
     characters (for example, "a" < "b"). It is the strongest difference. For
     example, dictionaries are divided into different sections by base character.
     This is also called the level-1 strength.

 2.  Secondary Level: Accents in the characters are considered secondary
     differences (for example, "as" < "às" < "at"). Other differences between
     letters can also be considered secondary differences, depending on the
     language. A secondary difference is ignored when there is a primary
     difference anywhere in the strings. This is also called the level-2
     strength.
     Note: In some languages (such as Danish), certain accented letters are
     considered to be separate base characters. In most languages, however, an
     accented letter only has a secondary difference from the unaccented version
     of that letter.

 3.  Tertiary Level: Upper and lower case differences in characters are
     distinguished at the tertiary level (for example, "ao" < "Ao" < "aò"). In
     addition, a variant of a letter differs from the base form on the tertiary
     level (such as "A" and "Ⓐ"). Another example is the difference between large
     and small Kana. A tertiary difference is ignored when there is a primary or
     secondary difference anywhere in the strings. This is also called the
     level-3 strength.

 4.  Quaternary Level: When punctuation is ignored (see Ignoring Punctuations
     (§)) at level 1-3, an additional level can be used to distinguish words with
     and without punctuation (for example, "ab" < "a-b" < "aB"). This difference
     is ignored when there is a primary, secondary or tertiary difference. This
     is also known as the level-4 strength. The quaternary level should only be
     used if ignoring punctuation is required or when processing Japanese text
     (see Hiragana processing (§)).

 5.  Identical Level: When all other levels are equal, the identical level is
     used as a tiebreaker. The Unicode code point values of the NFD form of each
     string are compared at this level, just in case there is no difference at
     levels 1-4 . For example, Hebrew cantillation marks are only distinguished
     at this level. This level should be used sparingly, as only code point
     value differences between two strings is an extremely rare occurrence.
     Using this level substantially decreases the performance for
     both incremental comparison and sort key generation (as well as increasing
     the sort key length). It is also known as level 5 strength.

 ## Backward Secondary Sorting

 Some languages require words to be ordered on the secondary level according to
 the *last* accent difference, as opposed to the *first* accent difference. This
 was previously the default for all French locales, based on some French
 dictionary ordering traditions, but is currently only applicable to Canadian
 French (locale **fr_CA**), for conformance with the [Canadian sorting
 standard](http://www.unicode.org/reports/tr10/#CanStd). The difference in
 ordering is only noticeable for a small number of pairs of real words. For more
 information see [UCA: Contextual
 Sensitivity](http://www.unicode.org/reports/tr10/#Contextual_Sensitivity).

 Example:

 Forward secondary | Backward secondary
 ----------------- | ------------------
 cote              | cote
 coté              | côte
 côte              | coté
 côté              | côté

 ## Contractions

 A contraction is a sequence consisting of two or more letters. It is considered
 a single letter in sorting.

 For example, in the traditional Spanish sorting order, "ch" is considered a
 single letter. All words that begin with "ch" sort after all other words
 beginning with "c", but before words starting with "d".

 Other examples of contractions are "ch" in Czech, which sorts after "h", and
 "lj" and "nj" in Croatian and Latin Serbian, which sort after "l" and "n"
 respectively.

 Example:

 Order without contraction | Order with contraction "lj" sorting after letter "l"
 ------------------------- | ----------------------------------------------------
 la                        | la
 li                        | li
 lj                        | lk
 lja                       | lz
 ljz                       | lj
 lk                        | lja
 lz                        | ljz
 ma                        | ma

 Contracting sequences such as the above are not very common in most languages.

 > :point_right: **Note** Since ICU 2.2, and as required by the UCA,
 > if a completely ignorable code point
 > appears in text in the middle of contraction, it will not break the contraction.
 > For example, in Czech sorting, cU+0000h will sort as it were ch.

 ## Expansions

 If a letter sorts as if it were a sequence of more than one letter, it is called
 an expansion.

 For example, in German phonebook sorting (de@collation=phonebook or BCP 47
 de-u-co-phonebk), "ä" sorts as though it were equivalent to the sequence "ae."
 All words starting with "ä" will sort between words starting with "ad" and words
 starting with "af".

 In the case of Unicode encoding, characters can often be represented either as
 pre-composed characters or in decomposed form. For example, the letter "à" can
 be represented in its decomposed (a+\`) and pre-composed (à) form. Most
 applications do not want to distinguish text by the way it is encoded. A search
 for "à" should find all instances of the letter, regardless of whether the
 instance is in pre-composed or decomposed form. Therefore, either form of the
 letter must result in the same sort ordering. The architecture of the ICU
 Collation Service supports this.

 ## Contractions Producing Expansions

 It is possible to have contractions that produce expansions.

 One example occurs in Japanese, where the vowel with a prolonged sound mark is
 treated to be equivalent to the long vowel version:

 カアー<<< カイー and\
 キイー<<< キイー

 > :point_right: **Note** Since ICU 2.0 Japanese tailoring uses
 > [prefix analysis](http://www.unicode.org/reports/tr35/tr35-collation.html#Context_Sensitive_Mappings)
 > instead of contraction producing expansions.

 ## Normalization

 In the section on expansions, we discussed that text in Unicode can often be
 represented in either pre-composed or decomposed forms. There are other types of
 equivalences possible with Unicode, including Canonical and Compatibility. The
 process of
 Normalization ensures that text is written in a predictable way so that searches
 are not made unnecessarily complicated by having to match on equivalences. Not
 all text is normalized, however, so it is useful to have a collation service
 that can address text that is not normalized, but do so with efficiency.

 The ICU Collation Service handles un-normalized text properly, producing the
 same results as if the text were normalized.

 In practice, most data that is encountered is in normalized or semi-normalized
 form already. The ICU Collation Service is designed so that it can process a
 wide range of normalized or un-normalized text without a need for normalization
 processing. When a case is encountered that requires normalization, the ICU
 Collation Service drops into code specific to this purpose. This maximizes
 performance for the majority of text that does not require normalization.

 In addition, if the text is known with certainty not to contain un-normalized
 text, then even the overhead of checking for normalization can be eliminated.
 The ICU Collation Service has the ability to turn Normalization Checking either
 on or off. If Normalization Checking is turned off, it is the user's
 responsibility to insure that all text is already in the appropriate form. This
 is true in a great majority of the world languages, so normalization checking is
 turned off by default for most locales.

 If the text requires normalization processing, Normalization Checking should be
 on. Any language that uses multiple combining characters such as Arabic, ancient
 Greek, Hebrew, Hindi, Thai or Vietnamese either requires Normalization Checking
 to be on, or the text to go through a normalization process before collation.

 For more information about Normalization related reordering please see
 [Unicode Technical Note #5](http://www.unicode.org/notes/tn5/) and
 [UAX #15.](http://www.unicode.org/reports/tr15/)

 > :point_right: **Note** ICU supports two modes of normalization: on and off.
 > Java.text.\* classes offer compatibility decomposition mode, which is not supported in ICU.

 ## Ignoring Punctuation

 In some cases, punctuation can be ignored while searching or sorting data. For
 example, this enables a search for "biweekly" to also return instances of
 "bi-weekly". In other cases, it is desirable for punctuated text to be
 distinguished from text without punctuation, but to have the text sort close
 together.

 These two behaviors can be accomplished if there is a way for a character to be
 ignored on all levels except for the quaternary level. If this is the case, then
 two strings which compare as identical on the first three levels (base letter,
 accents, and case) are then distinguished at the fourth level based on their
 punctuation (if any). If the comparison function ignores differences at the
 fourth level, then strings that differ by punctuation only are compared as
 equal.

 The following table shows the results of sorting a list of terms in 3 different
 ways. In the first column, punctuation characters (space " ", and hyphen "-")
 are not ignored (" " < "-" < "b"). In the second column, punctuation characters
 are ignored in the first 3 levels and compared only in the fourth level. In the
 third column, punctuation characters are ignored in the first 3 levels and the
 fourth level is not considered. In the last column, punctuated terms are
 equivalent to the identical terms without punctuation.

 For more options and details see the [“Ignore Punctuation”
 Options](customization/ignorepunct.md) page.

 Non-ignorable | Ignorable and Quaternary strength | Ignorable and Tertiary strength
 ------------- | --------------------------------- | -------------------------------
 black bird    | black bird                        | **black bird**
 black Bird    | black-bird                        | **black-bird**
 black birds   | blackbird                         | **blackbird**
 black-bird    | black Bird                        | black Bird
 black-Bird    | black-Bird                        | black-Bird
 black-birds   | blackBird                         | blackBird
 blackbird     | black birds                       | black birds
 blackBird     | black-birds                       | black-birds
 blackbirds    | blackbirds                        | blackbirds

 > :point_right: **Note** The strings with the same font format in the last column are
 compared as equal by ICU Collator.\
 > Since ICU 2.2 and as prescribed by the UCA, primary ignorable code points that
 > follow shifted code points will be completely ignored. This means that an accent
 > following a space will compare as if it was a space alone.

 ## Case Ordering

 The tertiary level is used to distinguish text by case, by small versus large
 Kana, and other letter variants as noted above.

 Some applications prefer to emphasize case differences so that words starting
 with the same case sort together. Some Japanese applications require the
 difference between small and large Kana be emphasized over other tertiary
 differences.

 The UCA does not provide means to separate out either case or Kana differences
 from the remaining tertiary differences. However, the ICU Collation Service has
 two options that help in customize case and/or Kana differences. Both options
 are turned off by default.

 ### CaseFirst

 The Case-first option makes case the most significant part of the tertiary
 level. Primary and secondary levels are unaffected. With this option, words
 starting with the same case sort together. The Case-first option can be set to
 make either lowercase sort before
 uppercase or uppercase sort before lowercase.

 Note: The case-first option does not constitute a separate level; it is simply a
 reordering of the tertiary level.

 ICU makes use of the following three case categories for sorting

 1.  uppercase: "ABC"

 2.  mixed case: "Abc", "aBc"

 3.  normal (lowercase or no case): "abc", "123"

 Mixed case is always sorted between uppercase and normal case when the
 "case-first" option is set.

 ### CaseLevel

 The Case Level option makes a separate level for case differences. This is an
 extra level positioned between secondary and tertiary. The case level is used in
 Japanese to make the difference between small and large Kana more important than
 the other tertiary differences. It also can be used to ignore other tertiary
 differences, or even secondary differences. This is especially useful in
 matching. For example, if the strength is set to primary only (level-1) and the
 case level is turned on, the comparison ignores accents and tertiary differences
 except for case. The contents of the case level are affected by the case-first
 option.

 The case level is independent from the strength of comparison. It is possible to
 have a collator set to primary strength with the case level turned on. This
 provides for comparison that takes into account the case differences, while at
 the same time ignoring accents and tertiary differences other than case. This
 may be used in searching.

 Example:

 **Case-first off, Case level off**

 apple\
 ⓐⓟⓟⓛⓔ\
 Abernathy\
 ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
 ähnlich\
 Ähnlichkeit

 **Lowercase-first, Case level off**

 apple\
 ⓐⓟⓟⓛⓔ\
 ähnlich\
 Abernathy\
 ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
 Ähnlichkeit

 **Uppercase-first, Case level off**

 Abernathy\
 ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
 Ähnlichkeit\
 apple\
 ⓐⓟⓟⓛⓔ\
 ähnlich

 **Lowercase-first, Case level on**

 apple\
 Abernathy\
 ⓐⓟⓟⓛⓔ\
 ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
 ähnlich\
 Ähnlichkeit

 **Uppercase-first, Case level on**

 Abernathy\
 apple\
 ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
 ⓐⓟⓟⓛⓔ\
 Ähnlichkeit\
 ähnlich

 ## Script Reordering

 Script reordering allows scripts and some other groups of characters to be moved
 relative to each other. This reordering is done on top of the DUCET/CLDR
 standard collation order. Reordering can specify groups to be placed at the
 start and/or the end of the collation order.

 By default, reordering codes specified for the start of the order are placed in
 the order given after several special non-script blocks. These special groups of
 characters are space, punctuation, symbol, currency, and digit. Script groups
 can be intermingled with these special non-script groups if those special groups
 are explicitly specified in the reordering.

 The special code `others` stands for any script that is not explicitly mentioned
 in the list. Anything that is after others will go at the very end of the list
 in the order given. For example, `[Grek, others, Latn]` will result in an
 ordering that puts all scripts other than Greek and Latin between them.

 ### Examples:

 Note: All examples below use the string equivalents for the scripts and reorder
 codes that would be used in collator rules. The script and reorder code
 constants that would be used in API calls will be different.

 **Example 1:**\
 set reorder code - `[Grek]`\
 result - `[space, punctuation, symbol, currency, digit, Grek, others]`

 **Example 2:**\
 set reorder code - `[Grek]`\
 result - `[space, punctuation, symbol, currency, digit, Grek, others]`

 followed by: set reorder code - `[Hani]`\
 result -` [space, punctuation, symbol, currency, digit, Hani, others]`

 That is, setting a reordering always modifies
 the DUCET/CLDR order, replacing whatever was previously set, rather than adding
 on to it. In order to cumulatively modify an ordering, you have to retrieve the
 existing ordering, modify it, and then set it.

 **Example 3:**\
 set reorder code - `[others, digit]`\
 result - `[space, punctuation, symbol, currency, others, digit]`

 **Example 4:**\
 set reorder code - `[space, Grek, punctuation]`\
 result - `[symbol, currency, digit, space, Grek, punctuation, others]`

 **Example 5:**\
 set reorder code - `[Grek, others, Hani]`\
 result - `[space, punctuation, symbol, currency, digit, Grek, others, Hani]`

 **Example 6:**\
 set reorder code - `[Grek, others, Hani, symbol, Tglg]`\
 result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`

 followed by:\
 set reorder code - `[NONE]`\
 result - DUCET/CLDR

 **Example 7:**\
 set reorder code - `[Grek, others, Hani, symbol, Tglg]`\
 result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`

 followed by:\
 set reorder code - `[DEFAULT]`\
 result - original reordering for the locale which may or may not be DUCET/CLDR

 **Example 8:**\
 set reorder code - `[Grek, others, Hani, symbol, Tglg]`\
 result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`

 followed by:\
 set reorder code - `[]`\
 result - original reordering for the locale which may or may not be DUCET/CLDR

 **Example 9:**\
 set reorder code - `[Hebr, Phnx]`\
 result - error

 Beginning with ICU 55, scripts only reorder together if they are primary-equal,
 for example Hiragana and Katakana.

 ICU 4.8-54:

 *   Scripts were reordered in groups, each normally starting with a [Recommended
     Script](http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts).
 *   Reorder codes moved as a group (were “equivalent”) if their scripts shared a
     primary-weight lead byte.
 *   For example, Hebr and Phnx were “equivalent” reordering codes and were
     reordered together. Their order relative to each other could not be changed.
 *   Only any one code out of any group could be reordered, not multiple of the
     same group.

 ## Sorting of Japanese Text (JIS X 4061)

 Japanese standard JIS X 4061 requires two changes to the collation procedures:
 special processing of Hiragana characters and (for performance reasons) prefix
 analysis of text.

 ### Hiragana Processing

 JIS X 4061 standard requires more levels than provided by the UCA. To offer
 conformant sorting order, ICU uses the quaternary level to distinguish between
 Hiragana and Katakana. Hiragana symbols are given smaller values than Katakana
 symbols on quaternary level, thus causing Hiragana sequences to sort before
 corresponding Katakana sequences.

 ### Prefix Analysis

 Another characteristics of sorting according to the JIS X 4061 is a large number
 of contractions followed by expansions (see
 [Contractions Producing Expansions](#contractions-producing-expansions)).
 This causes all the Hiragana and Katakana codepoints to be treated as
 contractions, which reduces performance. The solution we adopted introduces the
 prefix concept which allows us to improve the performance of Japanese sorting.
 More about this can be found in the [customization
 chapter](customization/index.md) .

 ## Thai/Lao reordering

 UCA requires that certain Thai and Lao prevowels be reordered with a code point
 following them. This option is always on in the ICU implementation, as
 prescribed by the UCA.

 This rule takes effect when:

 1.  A Thai vowel of the range \\U0E40-\\U0E44 precedes a Thai consonant of the
     range \\U0E01-\\U0E2E
     or

 2.  A Lao vowel of the range \\U0EC0-\\U0EC4 precedes a Lao consonant of the
     range \\U0E81-\\U0EAE. In these cases the vowel is placed after the
     consonant for collation purposes.

 > :point_right: **Note** There is a difference between java.text.\* classes and ICU in regard to Thai
 > reordering. Java.text.\* classes allow tailorings to turn off reordering by
 > using the '!' modifier. ICU ignores the '!' modifier and always reorders Thai
 > prevowels.

 ## Space Padding

 In many database products, fields are padded with null. To get correct results,
 the input to a Collator should omit any superfluous trailing padding spaces. The
 problem arises with contractions, expansions, or normalization. Suppose that
 there are two fields, one containing "aed" and the other with "äd". German
 phonebook sorting (de@collation=phonebook or BCP 47 de-u-co-phonebk) will
 compare "ä" as if it were "ae" (on a primary level), so the order will be "äd" <
 "aed". But if both fields are padded with spaces to a length of 3, then this
 will reverse the order, since the first will compare as if it were one character
 longer. In other words, when you start with strings 1 and 2

 1  | a  | e  | d         | \<space\>
 -- | -- | -- | --------- | ---------
 2  | ä  | d  | \<space\> | \<space\>

 they end up being compared on a primary level as if they were 1' and 2'

 1' | a  | e  | d  | \<space\> | &nbsp;
 -- | -- | -- | -- | --------- | ---------
 2' | a  | e  | d  | \<space\> | \<space\>

 Since 2' has an extra character (the extra space), it counts as having a primary
 difference when it shouldn't. The correct result occurs when the trailing
 padding spaces are removed, as in 1" and 2"

 1" | a  | e  | d
 -- | -- | -- | --
 2" | a  | e  | d

 ## Collator naming scheme

 ***Starting with ICU 54, the following naming scheme and its API functions are
 deprecated.*** Use ucol_open() with language tag collation keywords instead (see
 [Collation API Details](api.md)). For example,
 ucol_open("de-u-co-phonebk-ka-shifted", &errorCode) for German Phonebook order
 with "ignore punctuation" mode.

 When collating or matching text, a number of attributes can be used to affect
 the desired result. The following describes the attributes, their values, their
 effects, their normal usage, and the string comparison performance and sort key
 length implications. It also includes single-letter abbreviations for both the
 attributes and their values. These abbreviations allow a 'short-form'
 specification of a set of collation options, such as "UCA4.0.0_AS_LSV_S", which
 can be used to specific that the desired options are: UCA version 4.0.0; ignore
 spaces, punctuation and symbols; use Swedish linguistic conventions; compare
 case-insensitively.

 A number of attribute values are common across different attributes; these
 include **Default** (abbreviated as D), **On** (O), and **Off** (X). Unless
 otherwise stated, the examples use the UCA alone with default settings.

 > :point_right: **Note** In order to achieve uniqueness, a collator name always
 > has the attribute abbreviations sorted.

 ### Main References

 1.  For a full list of supported locales in ICU, see [Locale
     Explorer](http://demo.icu-project.org/icu-bin/locexp) , which also contains
     an on-line demo showing sorting for each locale. The demo allows you to try
     different attribute values, to see how they affect sorting.

 2.  To see tabular results for the UCA table itself, see the [Unicode Collation
     Charts](http://www.unicode.org/charts/collation/) .

 3.  For the UCA specification, see [UTS #10: Unicode Collation
     Algorithm](http://www.unicode.org/reports/tr10/) .

 4.  For more detail on the precise effects of these options, see [Collation
     Customization](customization/index.md) .

 #### Collator Naming Attributes

 Attribute              | Abbreviation | Possible Values
 ---------------------- | ------------ | ---------------
 Locale                 | L            | \<language\>
 Script                 | Z            | \<script\>
 Region                 | R            | \<region\>
 Variant                | V            | \<variant\>
 Keyword                | K            | \<keyword\>
 &nbsp;                 | &nbsp;       | &nbsp;
 Strength               | S            | 1, 2, 3, 4, I, D
 Case_Level             | E            | X, O, D
 Case_First             | C            | X, L, U, D
 Alternate              | A            | N, S, D
 Variable_Top           | T            | \<hex digits\>
 Normalization Checking | N            | X, O, D
 French                 | F            | X, O, D
 Hiragana               | H            | X, O, D

 #### Collator Naming Attribute Descriptions

 The **Locale** attribute is typically the most
 important attribute for correct sorting and matching, according to the user
 expectations in different countries and regions. The default UCA ordering will
 only sort a few languages such as Dutch and Portuguese correctly ("correctly"
 meaning according to the normal expectations for users of the languages).
 Otherwise, you need to supply the locale to UCA in order to properly collate
 text for a given language. Thus a locale needs to be supplied so as to choose a
 collator that is correctly **tailored** for that locale. The choice of a locale
 will automatically preset the values for all of the attributes to something that
 is reasonable for that locale. Thus most of the time the other attributes do not
 need to be explicitly set. In some cases, the choice of locale will make a
 difference in string comparison performance and/or sort key length.

 In short attribute names,
 `<language>_<script>_<region>_<variant>@collation=<keyword>` is
 represented by: `L<language>_Z<script>_R<region>_V<variant>_K<keyword>`. Not
 all the elements are required. Valid values for locale elements are general
 valid values for RFC 3066 locale naming.

 **Example:**\
 **Locale="sv" (Swedish)** "Kypper" < "Köpfe"\
 **Locale="de" (German)** "Köpfe" < "Kypper"

 The **Strength** attribute determines whether accents or
 case are taken into account when collating or matching text. ( (In writing
 systems without case or accents, it controls similarly important features). The
 default strength setting usually does not need to be changed for collating
 (sorting), but often needs to be changed when **matching** (e.g. SELECT). The
 possible values include Default (D), Primary (1), Secondary (2), Tertiary (3),
 Quaternary (4), and Identical (I).

 For example, people may choose to ignore accents or ignore accents and case when
 searching for text.

 Almost all characters are distinguished by the first three levels, and in most
 locales the default value is thus Tertiary. However, if Alternate is set to be
 Shifted, then the Quaternary strength (4) can be used to break ties among
 whitespace, punctuation, and symbols that would otherwise be ignored. If very
 fine distinctions among characters are required, then the Identical strength (I)
 can be used (for example, Identical Strength distinguishes between the
 **Mathematical Bold Small A** and the **Mathematical Italic Small A.** For more
 examples, look at the cells with white backgrounds in the collation charts).
 However, using levels higher than Tertiary - the Identical strength - result in
 significantly longer sort keys, and slower string comparison performance for
 equal strings.

 **Example:**\
 **S=1** role = Role = rôle\
 **S=2** role = Role < rôle\
 **S=3** role < Role < rôle

 The **Case_Level** attribute is used when ignoring accents
 **but not** case. In such a situation, set Strength to be Primary, and
 Case_Level to be On. In most locales, this setting is Off by default. There is a
 small string comparison performance and sort key impact if this attribute is set
 to be On.

 **Example:**\
 **S=1, E=X** role = Role = rôle\
 **S=1, E=O** role = rôle < Role

 The **Case_First** attribute is used to control whether
 uppercase letters come before lowercase letters or vice versa, in the absence of
 other differences in the strings. The possible values are Uppercase_First (U)
 and Lowercase_First (L), plus the standard Default and Off. There is almost no
 difference between the Off and Lowercase_First options in terms of results, so
 typically users will not use Lowercase_First: only Off or Uppercase_First.
 (People interested in the detailed differences between X and L should consult
 the [Collation Customization](customization/index.md) ).
 Specifying either L or U won't affect string comparison performance, but will
 affect the sort key length.

 **Example:**\
 **C=X or C=L** "china" < "China" < "denmark" < "Denmark"\
 **C=U** "China" < "china" < "Denmark" < "denmark"

 The **Alternate** attribute is used to control the handling of
 the so-called **variable **characters in the UCA: whitespace, punctuation and
 symbols. If Alternate is set to Non-Ignorable (N), then differences among these
 characters are of the same importance as differences among letters. If Alternate
 is set to Shifted (S), then these characters are of only minor importance. The
 Shifted value is often used in combination with Strength set to Quaternary. In
 such a case, white-space, punctuation, and symbols are considered when comparing
 strings, but only if all other aspects of the strings (base letters, accents,
 and case) are identical. If Alternate is not set to Shifted, then there is no
 difference between a Strength of 3 and a Strength of 4.

 For more information and examples, see
 [Variable_Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting) in
 the UCA.

 The reason the Alternate values are not simply On and Off is that
 additional Alternate values may be added in the future.

 The UCA option
 **Blanked** is expressed with Strength set to 3, and Alternate set to Shifted.

 The default for most locales is Non-Ignorable. If Shifted is selected, it may be
 slower if there are many strings that are the same except for punctuation; sort
 key length will not be affected unless the strength level is also increased.

 **Example:**\
 **S=3, A=N** di Silva < Di Silva < diSilva < U.S.A. < USA\
 **S=3, A=S** di Silva = diSilva < Di Silva < U.S.A. = USA\
 **S=4, A=S** di Silva < diSilva < Di Silva < U.S.A. < USA

 The **Variable_Top** attribute is only meaningful if the
 Alternate attribute is not set to Non-Ignorable. In such a case, it controls
 which characters count as ignorable. The \<hex\> value specifies the "highest"
 character sequence (in UCA order) weight that is to be considered ignorable.

 Thus, for example, if a user wanted white-space to be ignorable, but not any
 visible characters, then s/he would use the value Variable_Top=0020 (space). The
 digits should only be a single character. All characters of the same primary
 weight are equivalent, so Variable_Top=3000 (ideographic space) has the same
 effect as Variable_Top=0020.

 This setting (alone) has little impact on string comparison performance; setting
 it lower or higher will make sort keys slightly shorter or longer respectively.

 **Example:**\
 **S=3, A=S** di Silva = diSilva < U.S.A. = USA\
 **S=3, A=S, T=0020** di Silva = diSilva < U.S.A. < USA

 The **Normalization** setting determines whether
 text is thoroughly normalized or not in comparison. Even if the setting is off
 (which is the default for many locales), text as represented in common usage
 will compare correctly (for details, see [UTN
 #5](http://www.unicode.org/notes/tn5/)). Only if the accent marks are in
 non-canonical order will there be a problem. If the setting is On, then the best
 results are guaranteed for all possible text input.There is a medium string
 comparison performance cost if this attribute is On, depending on the frequency
 of sequences that require normalization. There is no significant effect on sort
 key length.If the input text is known to be in NFD or NFKD normalization forms,
 there is no need to enable this Normalization option.

 **Example:**\
 **N=X** ä = a + ◌̈ < ä + ◌̣ < ạ + ◌̈\
 **N=O** ä = a + ◌̈ < ä + ◌̣ = ạ + ◌̈

 Some **French** dictionary ordering traditions sort strings with
 different accents from the back of the string. This attribute is automatically
 set to On for the Canadian French locale (fr_CA). Users normally would not need
 to explicitly set this attribute. There is a string comparison performance cost
 when it is set On, but sort key length is unaffected.

 **Example:**\
 **F=X** cote < coté < côte < côté\
 **F=O** cote < côte < coté < côté

 Compatibility with JIS x 4061 requires the introduction of an
 additional level to distinguish **Hiragana** and Katakana characters. If
 compatibility with that standard is required, then this attribute is set On, and
 the strength should be set to at least Quaternary.

 This attribute is an implementation detail of the CLDR Japanese tailoring. The
 implementation might change to use a different mechanism to achieve the same
 Japanese sort order. Since ICU 50, this attribute is not settable any more.

 **Example:**\
 **H=X, S=4** きゅう = キュウ < きゆう = キユウ\
 **H=O, S=4** きゅう < キュウ < きゆう < キユウ

 > :point_right: **Note** If attributes in collator name are not overridden,
 > it is assumed that they are the same as for the given locale.
 > For example, a collator opened with an empty
 > string has the same attribute settings as **AN_CX_EX_FX_HX_KX_NX_S3_T0000**.*

 ### Summary of Value Abbreviations

 Value         | Abbreviation
 ------------- | ------------
 Default       | D
 On            | O
 Off           | X
 Primary       | 1
 Secondary     | 2
 Tertiary      | 3
 Quaternary    | 4
 Identical     | I
 Shifted       | S
 Non-Ignorable | N
 Lower-First   | L
 Upper-First   | U