| <html> |
| |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> |
| <meta name="GENERATOR" content="Microsoft FrontPage 4.0"> |
| <meta name="ProgId" content="FrontPage.Editor.Document"> |
| <title>XML Collation Specification</title> |
| <style> |
| <!-- |
| th { background-color: #9999CC; border-style: solid; border-width: 1px; padding: 4 } |
| td { background-color: #CCCCFF; border-style: solid; border-width: 1px; padding: 4 } |
| table { border-style: solid; border-width: 1px } |
| --> |
| </style> |
| </head> |
| |
| <body style="margin:2em"> |
| |
| <h1 align="center">XML Collation Specification</h1> |
| <p align="center"><i><font size="4"><b><font color="#FF0000">Early Draft:</font></b> |
| MED 2002-06-21</font></i></p> |
| <p>This document defines an XML vocabulary for exchanging tailoring rules, and |
| specifying comparison options. It allows any two implementations to exchange a |
| specification of collation. Using the same specification, the two |
| implementations will achieve the same results in comparing strings.</p> |
| <p> The rules are defined by correspondence with the <i>basic</i> <a href="http://oss.software.ibm.com/icu/userguide/Collate_Customization.html">ICU |
| rule syntax</a> (used in ICU and Java) and/or the ICU parameterizations. You |
| should be familiar with the UCA and the ICU implementation of it before |
| continuing with the rest of this document.</p> |
| <blockquote> |
| <p><b>Note: </b>ICU provides a concise format for specifying orderings, based |
| on tailorings to the UCA. For example, to specify that k and q follow 'c', one |
| can use the rule: "& c < k < q". The rules also allow |
| people to set default general parameter values, such as whether uppercase is |
| before lowercase or not.</p> |
| <p>Java contains an earlier version of ICU, and has not been updated recently. |
| It does not support any of the basic syntax marked with [...], and its default |
| table is not the UCA.</p> |
| <p>It is not necessary for ICU to be used in the underlying implementation. |
| The features are simply described here in terms of the ICU capabilities, since |
| that is easier than duplicating the text.</p> |
| </blockquote> |
| <p>Like the ICU rules, the tailoring syntax is designed to be independent of the |
| actual weights used in any particular UCA table. That way the same rules can be |
| applied to UCA versions over time, even if the underlying weights change.</p> |
| <h3><a name="Document_Structure">Document Structure</a></h3> |
| <p>The following describes the overall document structure used to specify a |
| collation in XML.</p> |
| <p><code><collation name="somename"><br> |
| <base .../><br> |
| <settings .../><br> |
| <rules><br> |
| <!-- rules go here, if there are any --><br> |
| </rules><br> |
| </collation></code></p> |
| <table border="1" width="100%"> |
| <tr> |
| <td width="100%"><b>TBD:</b> |
| <ul> |
| <li><b>Add DTD</b></li> |
| <li><b>Clarify how versions work.</b></li> |
| <li><b>Add Namespace</b></li> |
| </ul> |
| </td> |
| </tr> |
| </table> |
| <h3><a name="Base">Base</a></h3> |
| <p>There must be exactly one base element. The base element indicates the |
| collation ordering that is to be used as a foundation. This base collation |
| ordering can be modified (tailored) by a rules element, and the settings in the |
| base can be overridden by the settings element. The rules are treated as if they |
| were appended to the rules in the URL. When the xml:lang is used, then the rules |
| in the ICU repository with that version are specified. There are two alternative |
| attributes:</p> |
| <table> |
| <tr> |
| <th>Attribute</th> |
| <th>Options</th> |
| <th>XML Example</th> |
| <th>Description</th> |
| </tr> |
| <tr> |
| <td>uca</td> |
| <td><i>uca version/unicode version</i></td> |
| <td>uca="3.1.1d1/3.2.0"</td> |
| <td>Specifies the UCA version</td> |
| </tr> |
| <tr> |
| <td>src</td> |
| <td><i>URL</i></td> |
| <td>src="http://www.foo.com/sort_en_us.xml"</td> |
| <td>Points to a different collation specification.</td> |
| </tr> |
| </table> |
| <p>The first one is used for a direct table, one that either uses the UCA alone, |
| or modifies it with settings and/or rules. The second one is used to refer to a |
| pre-existing document in this format, which can also be modified with settings |
| and/or rules.</p> |
| <p><i>Example 1:<br> |
| The following specifies a German phonebook ordering, by setting the umlauted |
| letters to be equivalent to base + e.</i></p> |
| <blockquote> |
| <pre><collation name="German Phonebook Ordering"> |
| <base uca="3.1.1d1/3.2.0"/> |
| <rules> |
| <reset/> ae <t/> ä |
| <reset/> AE <t/> Ä |
| <reset/> oe <t/> ö |
| <reset/> OE <t/> Ö |
| <reset/> ue <t/> ü |
| <reset/> UE <t/> Ü |
| </rules> |
| </collation></pre> |
| </blockquote> |
| <p><i>Example 2:<br> |
| Supposing the above is on the web at <a href="http://www.foo.com/de_de_phonebook.xml">http://www.foo.com/de_de_phonebook.xml</a>, |
| the following modifies that to sort uppercase first, and sort the character '@' |
| as if it were spelled out.</i></p> |
| <blockquote> |
| <pre><collation name="German Phonebook Ordering, Uppercase First with Ampersand"> |
| <base src="http://www.foo.com/de_de_phonebook.xml"/> |
| <setting caseFirst="upper"/> |
| <rules> |
| <reset/> @ <t/> Affenschwanz |
| </rules> |
| </collation></pre> |
| </blockquote> |
| <h3><a name="Setting_Options">Setting Options</a></h3> |
| <p>There must be exactly one settings element. It contains global settings on |
| the collation sequence. For example, <setting |
| strength="secondary"> will only compare strings based on their |
| primary and secondary weights, ignoring any weaker weights.</p> |
| <p>The following table provides a list of valid attributes. If any of the |
| attributes is not present, the default for the base is used. The default for the |
| UCA is listed in italics below, but it may be modified by the base. The effect |
| of these attributes is defined by reference to the effect of the <a href="http://oss.software.ibm.com/icu/apiref/ucol_8h.html#a69">setAttributes</a> |
| API (except for variableTop, which corresponds to the <a href="http://oss.software.ibm.com/icu/apiref/classCollator.html#a21">setVariableTop</a> |
| API). <i>[Ed. Note: This is temporary, until the textual description is brought |
| in here]. </i>The basic example is given where the setting can also be given |
| with rules in the basic syntax.</p> |
| <table> |
| <tbody> |
| <tr> |
| <th>Attribute</th> |
| <th>Options</th> |
| <th>Basic Example </th> |
| <th>XML Example</th> |
| </tr> |
| <tr> |
| <td>alternate</td> |
| <td><i>non-ignorable</i><br> |
| shifted</td> |
| <td><font color="#000000"><code>[alternate non-ignorable]</code></font></td> |
| <td><code>alternate="non-ignorable"</code></td> |
| </tr> |
| <tr> |
| <td>backwards</td> |
| <td>on<br> |
| <i>off</i></td> |
| <td><font color="#000000"><code>[backwards on] </code></font></td> |
| <td><code>backwards="on"</code></td> |
| </tr> |
| <tr> |
| <td>normalization</td> |
| <td>on<br> |
| off</td> |
| <td><font color="#000000"><code>[normalization on] </code></font></td> |
| <td><code>normalization="off"</code></td> |
| </tr> |
| <tr> |
| <td>caseLevel</td> |
| <td>on<br> |
| off</td> |
| <td><font color="#000000"><code>[caseLevel on]</code></font></td> |
| <td><code>caseLevel="off"</code></td> |
| </tr> |
| <tr> |
| <td>caseFirst</td> |
| <td>upper<br> |
| lower<br> |
| off</td> |
| <td><font color="#000000"><code>[caseFirst off]</code></font></td> |
| <td><code>caseFirst="off"</code></td> |
| </tr> |
| <tr> |
| <td>hiraganaQ</td> |
| <td>on<br> |
| off</td> |
| <td><code>[hiraganaQ on]</code></td> |
| <td><code>hiraganaQuarternary="on"</code></td> |
| </tr> |
| <tr> |
| <td><font color="#000000">strength</font></td> |
| <td>primary (1)<br> |
| secondary (2)<br> |
| tertiary (3)<br> |
| quarternary (4)<br> |
| identical (5)</td> |
| <td><code>[strength 1]</code></td> |
| <td><code>strength="primary"</code></td> |
| </tr> |
| <tr> |
| <td>variableTop<sup>1</sup></td> |
| <td><font color="#000000">at character(s)<br> |
| before character(s)<br> |
| after character(s)</font></td> |
| <td><code>& x = [variable top]</code></td> |
| <td><code>variableTopAfter="x"</code></td> |
| </tr> |
| </tbody> |
| </table> |
| <blockquote> |
| <p><b>Issue:</b> This syntax might limit the characters in variableTop, since |
| attributes can't handle all characters. Perhaps this needs to be a separate |
| element.</p> |
| <ol> |
| <li>The default value for variableTop depends on the UCA setting. For |
| example, in 3.1.1d1, the value is:<br> |
| U+1D7C3 MATHEMATICAL SANS-SERIF BOLD ITALIC PARTIAL DIFFERENTIAL. See |
| below for the layout.</li> |
| </ol> |
| </blockquote> |
| <h2><a name="Rules">Rules</a></h2> |
| <p>The rules section, if there is one, contains rules that tailor whatever was |
| in the base. The rule syntax, while valid XML, is somewhat unusual. The goal is |
| to have clearly expressed rules, with a concise format, that parallels the Basic |
| syntax as much as possible.</p> |
| <h3><a name="Orderings">Orderings</a></h3> |
| <p>The following are the normal orderings used for the bulk of characters.</p> |
| <table> |
| <tr> |
| <th>Basic Symbol</th> |
| <th>Basic Example</th> |
| <th>XML Symbol</th> |
| <th>XML Example</th> |
| <th>Description</th> |
| </tr> |
| <tr> |
| <td align="center"><code>< </code></td> |
| <td><code>a < b </code></td> |
| <td><code><p/></code></td> |
| <td><code>a <p/> b</code></td> |
| <td>Make 'b' sort after 'a', as a <i>primary</i> (base-character) difference</td> |
| </tr> |
| <tr> |
| <td align="center"><code><< </code></td> |
| <td><code>a << ä </code></td> |
| <td><code><s/></code></td> |
| <td><code>a <s/> ä</code></td> |
| <td>Make 'ä' sort after 'a' as a <i>secondary</i> (accent) difference</td> |
| </tr> |
| <tr> |
| <td align="center"><code><<< </code></td> |
| <td><code>a <<< A </code></td> |
| <td><code><t/></code></td> |
| <td><code>a <t/> A</code></td> |
| <td>Make 'A' sort after 'a' as a <i>tertiary</i> (case) difference</td> |
| </tr> |
| <tr> |
| <td align="center"><code>= </code></td> |
| <td><code>x = y </code></td> |
| <td><code><eq/></code></td> |
| <td><code>v <eq/> w</code></td> |
| <td>Make 'w' sort exactly the same as 'v'</td> |
| </tr> |
| <tr> |
| <td align="center"><code>& </code></td> |
| <td><code>& Z </code></td> |
| <td><code><reset/></code></td> |
| <td><code><reset/> Z</code></td> |
| <td>Don't change the ordering of Z, but place subsequent characters relative |
| to it.</td> |
| </tr> |
| </table> |
| <p>Note that each character is placed relative to the characters <i>before</i> |
| it. Thus the following means "change the weight of W so that it comes after |
| Z, and with a primary difference.</p> |
| <blockquote> |
| <pre><reset/> Z <p> W</pre> |
| </blockquote> |
| <h3><a name="Escaping_Characters">Escaping Characters</a></h3> |
| <p>Unfortunately, XML does not have the capability to contain all Unicode code |
| points. Due to this, extra syntax is required to represent those code points |
| that cannot be otherwise represented. This corresponds to the quoting mechanism |
| used in the basic syntax. This also must be used where spaces are significant |
| (otherwise they are stripped).</p> |
| <table> |
| <tr> |
| <th>Basic Example</th> |
| <th>XML Example</th> |
| </tr> |
| <tr> |
| <td><code>'\u0000'</code></td> |
| <td><code><cp hex="0"></code></td> |
| </tr> |
| </table> |
| <h3><a name="Contractions">Contractions</a></h3> |
| <p>To sort a sequence as a single item (contraction), just use the sequence, |
| e.g.</p> |
| <table> |
| <tr> |
| <th>BASIC Example</th> |
| <th>XML Example</th> |
| <th>Description</th> |
| </tr> |
| <tr> |
| <td><code>& k < ch</code></td> |
| <td><code><reset/> k <p/> ch</code></td> |
| <td>Make the sequence 'ch' sort after 'k', as a primary (base-character) |
| difference</td> |
| </tr> |
| </table> |
| <h3><a name="Expansions">Expansions</a></h3> |
| <p>There are two ways to handle expansions (where a character sorts as a |
| sequence) with both the basic syntax and the XML syntax. The first method is to |
| reset to the sequence of characters. The second is to use the extension |
| sequence. Both are equivalent in practice (unless the reset sequence happens to |
| be a contraction).</p> |
| <table> |
| <tr> |
| <th>Basic</th> |
| <th>XML</th> |
| <th>Description</th> |
| </tr> |
| <tr> |
| <td><code>& ae < </code>ä</td> |
| <td><code><reset/> ae <p/> </code>ä</td> |
| <td>Make 'k' sort after the sequence 'ch'; thus 'k' will behave as if it |
| expands to a character after 'c' followed by an 'h'. (unless 'ch' is |
| defined beforehand as a contraction).</td> |
| </tr> |
| <tr> |
| <td><code>& a < </code>ä<code> / e</code></td> |
| <td><code><reset/> a <p/> </code>ä<code> <x/> e</code></td> |
| <td>Make 'k' sort after the sequence 'ch'; thus 'k' will behave as if it |
| expands to a character after 'c' followed by an 'h'.</td> |
| </tr> |
| </table> |
| <p>In the basic syntax, you can reset variable top by treating it as if it were |
| a character. In XML, it is always an option on settings, as described above.</p> |
| <h3><a name="Context_Before">Context Before</a></h3> |
| <p>The context before a character can affect how it is ordered, such as in |
| Japanese. This could be expressed with a combination of contractions and |
| expansions, but is faster using a context. (The actual weights produced are |
| different, but the resulting string comparisons are the same.)</p> |
| <table> |
| <tr> |
| <th>Basic</th> |
| <th>XML</th> |
| </tr> |
| <tr> |
| <td><code>& ァ<br> |
| <<< ァ | ー<br> |
| = ァ | ー<br> |
| = ぁ | ー</code></td> |
| <td><code><reset/></code><code> </code><code>ァ<br> |
| <t></code><code> </code><code>ァ</code><code> </code><code><context/></code><code> </code><code>ー<br> |
| <eq></code><code> </code><code>ァ</code><code> </code><code><context/></code><code> </code><code>ー<br> |
| <eq></code><code> </code><code>ぁ</code><code> </code><code><context/></code><code> </code><code>ー</code></td> |
| </tr> |
| </table> |
| <h3><a name="Placing_Characters_Before_Others">Placing Characters Before Others</a></h3> |
| <p>There are certain circumstances where characters need to be placed before a |
| given character, rather than after. This is the case with Pinyin, for example, |
| where certain accented letters are positioned before the base letter. That is |
| accomplished with the following syntax.</p> |
| <table> |
| <tbody> |
| <tr> |
| <th>Item</th> |
| <th>Options</th> |
| <th>Basic Example </th> |
| <th>XML Example</th> |
| </tr> |
| <tr> |
| <td>before </td> |
| <td>primary<br> |
| secondary<br> |
| tertiary<br> |
| identical</td> |
| <td><code>& [before 1] a<br> |
| << à</code></td> |
| <td><code><reset before="primary"/> a<br> |
| <s/> à</code></td> |
| </tr> |
| </tbody> |
| </table> |
| <h3><a name="Logical_Reset_Positions">Logical Reset Positions</a></h3> |
| <p>The UCA has the following structure for primary weights, going from low to |
| high.</p> |
| <table> |
| <tr> |
| <th valign="top" align="center" bgcolor="#CCCCFF">Items</th> |
| <th valign="top" align="center" bgcolor="#CCCCFF">Description</th> |
| <th valign="top" align="center" bgcolor="#CCCCFF">UCA Examples</th> |
| </tr> |
| <tr> |
| <td>first tertiary ignorable<br> |
| ...<br> |
| last tertiary ignorable</td> |
| <td>primary, secondary, tertiary weights = ignore</td> |
| <td>Control Codes<br> |
| Format Characters<br> |
| Hebrew Points<br> |
| Tibetan Signs<br> |
| ...</td> |
| </tr> |
| <tr> |
| <td>first secondary ignorable<br> |
| ...<br> |
| last secondary ignorable</td> |
| <td>primary, secondary weights = ignore</td> |
| <td>None in UCA</td> |
| </tr> |
| <tr> |
| <td>first primary ignorable<br> |
| ...<br> |
| last primary ignorable</td> |
| <td>primary weights = ignore</td> |
| <td>Most combining marks</td> |
| </tr> |
| <tr> |
| <td>first variable<br> |
| ...<br> |
| last variable</td> |
| <td>primary weights != ignore,<br> |
| <i> <b>if</b> alternate = non-ignorable<br> |
| </i><br> |
| primary, secondary, tertiary weights = ignore,<br> |
| <i><b>if</b> alternate = shifted</i></td> |
| <td>Whitespace,<br> |
| Punctuation,<br> |
| Symbols</td> |
| </tr> |
| <tr> |
| <td>first non-ignorable<br> |
| ...<br> |
| last non-ignorable</td> |
| <td>primary weights != ignore</td> |
| <td>Small number of exceptional symbols<br> |
| [e.g. U+02D0 MODIFIER LETTER TRIANGULAR COLON]<br> |
| Numbers<br> |
| Latin<br> |
| Greek<br> |
| ...</td> |
| </tr> |
| <tr> |
| <td><i>implicits</i></td> |
| <td>primary weights != ignore,<br> |
| <i>assigned automatically</i></td> |
| <td>CJK, CJK compatibility (that are not decomposed)<br> |
| CJK Extension A, B<br> |
| Unassigned</td> |
| </tr> |
| <tr> |
| <td>first trailing<br> |
| ...<br> |
| last trailing</td> |
| <td>primary weights != ignore,<br> |
| <i>used for trailing syllable components</i></td> |
| <td>Jamo Trailing<br> |
| Jamo Leading</td> |
| </tr> |
| </table> |
| <p>Each of the above values (except <i>implicits</i>) can be used with a reset |
| to position characters after (or before) that logical position. That allows |
| characters to be ordered before or after a logical position rather than a |
| specific character.</p> |
| <blockquote> |
| <p>The reason for this is so that tailorings can be more stable. A future |
| version of the UCA might add characters at any point in the above list. |
| Suppose that you set character X to be after Y. It could be that you want X to |
| come after Y, no matter what future characters are added; or it could be that |
| you just want Y to come after a given logical position, e.g. after the last |
| primary ignorable.</p> |
| </blockquote> |
| <p>Here is an example of the syntax:</p> |
| <table> |
| <tr> |
| <th>Basic</th> |
| <th>XML</th> |
| </tr> |
| <tr> |
| <td><code>& [first tertiary ignorable]<br> |
| << à</code></td> |
| <td><code><reset/><position at="first tertiary |
| ignorable"/><br> |
| <s/> à</code></td> |
| </tr> |
| </table> |
| <p>For example, to make a character be a secondary ignorable, one can make it be |
| immediately after (at a secondary level) a specific character (like a combining |
| dieresis), or one can make it be immediately after the last secondary ignorable.</p> |
| |
| </body> |
| |
| </html> |