| --- |
| layout: default |
| title: Ignore Punctuation Options |
| nav_order: 8 |
| parent: Collation |
| --- |
| <!-- |
| © 2020 and later: Unicode, Inc. and others. |
| License & terms of use: http://www.unicode.org/copyright.html |
| --> |
| |
| # “Ignore Punctuation” Options |
| {: .no_toc } |
| |
| ## Contents |
| {: .no_toc .text-delta } |
| |
| 1. TOC |
| {:toc} |
| |
| --- |
| |
| ## Overview |
| |
| By default, spaces and punctuation characters add primary (base character) |
| differences. Such characters sort less-than digits and letters. For example, the |
| default collation yields “De Anza” < “de-luge” < “deanza”. |
| |
| UCA/CLDR/ICU provide several options for “ignore punctuation” collation |
| settings, also known as Variable Weighting or Alternate Handling. These options |
| change the sorting behavior of “variable” characters algorithmically. “Variable” |
| characters are those with low (but non-zero) primary weights up to a threshold, |
| the “variable top”. By default, CLDR and ICU treat spaces and punctuation as |
| variable. (This can be changed via API.) The DUCET also includes most symbols. |
| |
| ## Non-Ignorable |
| |
| The default behavior in CLDR & ICU, shown above, is to not ignore punctuation |
| (alternate=non-ignorable) but to map variable characters to their normal primary |
| collation elements. |
| |
| All of the following options cause variable characters to be ignored on levels |
| 1..3. Only when strings compare equal up to the tertiary level may variable |
| characters make a difference, depending on the options. |
| |
| See also |
| |
| * [UCA: Variable |
| Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting) |
| * [LDML: Setting |
| Options](https://htmlpreview.github.io/?https://github.com/unicode-org/cldr/blob/main/docs/ldml/tr35-collation.html#Setting_Options) |
| |
| Here is an overview of the sorting results with these options. |
| |
| Non-ignorable | Blanked | Shifted | Shift-Trimmed | Variable-After |
| ------------- | ------------ | ------- | ------------- | -------------- |
| delug | delug | delug | delug | delug |
| de-luge | de-luge | de-luge | *deluge* | *deluge* |
| delu-ge | delu-ge (*) | delu-ge | de-luge | deluge- |
| *deluge* | *deluge* (*) | *deluge* | delu-ge | delu-ge |
| Deluge | deluge- (*) | deluge- | deluge- | de-luge |
| deluge- | Deluge | Deluge | Deluge | Deluge |
| |
| Items with (*) compare equal to the preceding ones, and their relative order |
| is arbitrary. These only occur in the Blanked column. This table shows the |
| results of a stable sort algorithm with the non-ignorable column as input. |
| |
| ## Blanked |
| |
| The simplest option is to “ignore punctuation” completely, as if all variable |
| characters (and following combining marks) had been removed from the input |
| strings before comparing them. |
| |
| For example: “De Anza” = “De-Anza” = “DeAnza”. |
| |
| In ICU, this option is selected with alternate=shifted and |
| strength=primary|secondary|tertiary. (ICU does not support Blanked combined with |
| strength=identical.) |
| |
| The implementation “blanks” out all weights of the variable characters’ |
| collation elements. |
| |
| *With all of the following options, variable characters are ignored on levels |
| 1..3 but add distinctions on level 4 (quaternary level).* |
| |
| ## Shifted |
| |
| Among strings that compare tertiary-equal, that is, they contain the same |
| letters, accents and casing: |
| |
| * Sorts all variable characters less-than (before) regular characters. |
| * Appending a variable character makes a string sort *greater-than* the string |
| without it. |
| * *Inserting* a variable character makes a string sort *less-than* the string |
| without it. |
| * Inserting a variable character *earlier* in a string makes it sort |
| *less-than* inserting the variable character *later* in the string. |
| |
| The result is similar to [Merging Sort |
| Keys](http://www.unicode.org/reports/tr10/#Merging_Sort_Keys) (with shorter |
| prefixes sorting less-than longer ones), like in last-name+first-name sorting, |
| except only among tertiary-equal strings. |
| |
| For example: “de-luge” < “delu-ge” < “deluge” < “deluge-”. |
| |
| In ICU, this option is selected with alternate=shifted and |
| strength=quaternary|identical. |
| |
| The implementation “shifts” the primary weight p of the collation element \[p, |
| s, t, q\] of each variable characters down three levels: \[0, 0, 0, p\]. Regular |
| characters with primary collation elements get a high quaternary weight, higher |
| than that of any variable character. |
| |
| Note that this behavior is different from collation on secondary and tertiary |
| level, because normal collation elements get low secondary & tertiary weights |
| but high quaternary weights. Adding an accent difference anywhere makes a string |
| sort greater-than the string without it, and adding an accent difference earlier |
| makes it sort greater-than adding it later. For example, “deanza” < “deanzä” < |
| “deänza” < “dëanza”. (Compare the ‘ä’/‘ë’ positions here with the ‘-’ positions |
| above.) |
| |
| ## Shift-Trimmed |
| |
| *Note: This method is not currently implemented in ICU.* |
| |
| Among strings that compare tertiary-equal: |
| |
| * Sorts variable characters sometimes less-than, sometimes greater-than |
| regular characters. |
| * Inserting a variable character anywhere makes a string sort *greater-than* |
| the string without it. (The string without variable characters gets an empty |
| quaternary level.) |
| * Inserting a variable character *earlier* in a string makes it sort |
| *less-than* inserting the variable character *later* in the string. |
| |
| For example: “deluge” < “de-luge” < “delu-ge” < “deluge-”. |
| |
| The Shift-Trimmed method works like Shifted, except that *trailing* |
| high-quaternary weights (from regular characters) are removed (trimmed). |
| Compared with Shifted, the Shift-Trimmed method sorts strings without variable |
| characters before ones with variable characters added, rather than producing the |
| equivalent of [Merging Sort |
| Keys](http://www.unicode.org/reports/tr10/#Merging_Sort_Keys). |
| |
| Shift-Trimmed is more complicated to implement than all of the other options: |
| When comparing strings, a lookahead (or equivalent) is needed to determine |
| whether a non-variable character gets a zero quaternary weight (if no variables |
| follow) or a high quaternary weight (if at least one variable follows). When |
| building sort keys, trailing high/common quaternary weights are trimmed (backed |
| out) at the end of the quaternary level. |
| |
| ## Variable-After |
| |
| *Note: This method is not currently implemented in ICU.* |
| |
| Among strings that compare tertiary-equal: |
| |
| * Sorts all variable characters greater-than (after) regular characters. |
| * Inserting a variable character anywhere makes a string sort *greater-than* |
| the string without it. (Like Shift-Trimmed.) |
| * Inserting a variable character *earlier* in a string makes it sort |
| *greater-than* inserting the variable character *later* in the string. (Like |
| accent differences.) |
| |
| For example: “deluge” < “deluge-” < “delu-ge” < “de-luge”. |
| |
| The implementation “shifts” the primary weight p of the collation element \[p, |
| s, t, q\] of each variable characters down three levels: \[0, 0, 0, p\]. Regular |
| characters with primary collation elements get a *low* quaternary weight, |
| *lower* than that of any variable character. This is consistent with collation |
| on secondary and tertiary levels but unlike [Merging Sort |
| Keys](http://www.unicode.org/reports/tr10/#Merging_Sort_Keys). |
| |
| This method extends the [UCA well-formedness condition |
| 2](http://www.unicode.org/reports/tr10/#WF2) to apply to quaternary weights. |
| (UCA versions before UCA 6.2 did not limit WF2 to secondary & tertiary weights, |
| which meant that several of the Variable Weighting options technically created |
| ill-formed quaternary weights.) |