| <html> |
| |
| <head> |
| <meta http-equiv="Content-Language" content="en-us"> |
| <meta name="GENERATOR" content="Microsoft FrontPage 5.0"> |
| <meta name="ProgId" content="FrontPage.Editor.Document"> |
| <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
| <title>IDN Characters</title> |
| <link rel="stylesheet" type="text/css" href="idn-chars.css"> |
| </head> |
| |
| <body> |
| |
| <h1>IDN Character Categorization</h1> |
| <p><i>%date%, MED</i></p> |
| <p>This page lists all Unicode characters relevant to IDN in a <a href="#Categorization">chart</a>, |
| broken down by category. Characters are grouped first by script, and then by subcategory.</p> |
| <p>The "output" IDN characters are ones that can result from nameprep, while the "input" characters |
| are those that are allowed in input, but transformed (remapped or deleted). Tool-tips provide the |
| character code and name (in enabled browsers). The following table described the subcategories. |
| Within each subcategory characters are sorted according to the default |
| <a href="http://www.unicode.org/reports/tr10/">UCA</a> order.</p> |
| <blockquote> |
| <table border="1" cellpadding="2" cellspacing="0"> |
| <caption><b><font size="4">Key</font></b></caption> |
| <tr> |
| <th>Type</th> |
| <th>Subcategory</th> |
| <th>Description</th> |
| </tr> |
| <tr> |
| <th rowspan="5">Output</th> |
| <td class="Atomic"><a name="Atomic">Atomic</a></td> |
| <td>Characters that don't fall into any of the following subcategories</td> |
| </tr> |
| <tr> |
| <td class="Atomic-no-uppercase"><a name="Atomic-no-uppercase">Atomic-no-uppercase</a></td> |
| <td>For bicameral scripts, Atomic characters without an uppercase. These need to be examined |
| to see which are used in modern languages.</td> |
| </tr> |
| <tr> |
| <td class="Pattern_Syntax"><a name="Pattern_Syntax">Pattern_Syntax</a></td> |
| <td>Characters recommended as a basis for use in pattern syntax. Excludes the |
| <a href="#Word_Characters">additional word characters</a>.</td> |
| </tr> |
| <tr> |
| <td class="Non-XID"><a name="Non-XID">Non-XID</a></td> |
| <td>Characters not recommended as a basis for identifiers, excluding Pattern_Syntax and |
| <a href="#Word_Characters">additional word characters</a>.</td> |
| </tr> |
| <tr> |
| <td class="NFD-Decomposable"><a name="NFD-Decomposable">NFD-Decomposable</a></td> |
| <td>Characters with NFD (canonical) decompositions. These are broken out separately because |
| certain spoofing techniques are applied to them <i>via their decompositions.</i></td> |
| </tr> |
| <tr> |
| <th rowspan="4">Input</th> |
| <td class="IDN-Remapped-Case-Atomic"><a name="IDN-Remapped-Case-Atomic"> |
| IDN-Remapped-Case-Atomic</a></td> |
| <td>Atomic characters remapped by IDN due to case folding [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a> |
| Section 3.2].</td> |
| </tr> |
| <tr> |
| <td class="IDN-Remapped-Case-NFD-Decomposable"><a name="IDN-Remapped-Case-NFD-Decomposable"> |
| IDN-Remapped-Case-NFD-Decomposable</a></td> |
| <td>Characters that are NFD (canonical) decomposable and that are remapped by IDN due to case |
| folding [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a> Section 3.2].</td> |
| </tr> |
| <tr> |
| <td class="IDN-Remapped-Compat"><a name="IDN-Remapped-Compat">IDN-Remapped</a></td> |
| <td>Characters remapped by IDN due to compatibility (NFKD) mapping. [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a> |
| Section 4]</td> |
| </tr> |
| <tr> |
| <td class="IDN-Deleted"><a name="IDN-Deleted">IDN-Deleted</a></td> |
| <td>Characters deleted by IDN, that is, mapped to nothing [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a> |
| Section 3.1]</td> |
| </tr> |
| <tr> |
| <th>Prohibited</th> |
| <td class="IDN-Prohibited"><a name="IDN-Prohibited">IDN-Prohibited </a></td> |
| <td>Characters prohibited in IDN [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a> |
| Section 5] (Note: most of these are due to IDN's using an old version of Unicode. IDN does |
| treat unassigned characters differently than explicitly prohibited characters, but for our |
| purposes this distinction doesn't matter.)</td> |
| </tr> |
| </table> |
| </blockquote> |
| <p>Characters that are normally invisible are represented in the chart by their Unicode number, such as "U+FE00".</p> |
| <p>At the end of this document, there is an additional section that lists all <a href='#Visible_Combining_Marks_0'>visible non-spacing marks</a>. |
| These are sorted first by combining character class (modified), then by script, then by code point..</p> |
| <p>For comparison of Indic characters, see <a href='indic-trans.html'>indic-trans.html</a>.</p> |
| <h3>Additional <a name="Word_Characters">Word Characters</a></h3> |
| <p>This is a draft list of characters based on <i>Section 4 Word Boundaries</i> of |
| <a href="http://www.unicode.org/reports/tr29/tr29-9.html#Word_Boundaries">UAX# 29</a>, in the |
| Word_Break property and notes at the end of the section. While not currently a part of the |
| recommended characters for programming identifiers (XID_Continue), these characters have been |
| identified as being necessary for more "natural language" identifiers, since some words in some |
| modern languages could not be constructed without them. See also |
| <a href="http://www.unicode.org/reports/tr31/tr31-5.html">UAX #31: Identifier and Pattern Syntax</a>. |
| These characters are listed in the plain text file, as described below.</p> |
| <h2>Plain-Text Version</h2> |
| <p>The information in the categorization is also available in a plain-text file, at |
| <a href="idn-chars.txt">idn-chars.txt</a>. It can be viewed as is, or loaded into a spreadsheet for |
| sorting and filtering to view the data in different ways. The format is:</p> |
| <blockquote> |
| <p>code ; script ; subcategory # general-category (character) character-name</p> |
| </blockquote> |
| <p><i>Examples:</i></p> |
| <pre>0061 ; LATIN ; Atomic # ; L& (a) LATIN SMALL LETTER A |
| <code>026B ; LATIN ; Atomic-no-uppercase # L& (?) LATIN SMALL LETTER L WITH MIDDLE TILDE</code> |
| 2015 ; COMMON ; Pattern_Syntax # Pd (―) HORIZONTAL BAR |
| 058A ; ARMENIAN ; Atomic-no-uppercase # ; Pd (֊) ARMENIAN HYPHEN |
| 20AC ; COMMON ; Non-XID # ; Sc (€) EURO SIGN</pre> |
| <p>At the end of <a href="idn-chars.txt">idn-chars.txt</a> is a section called ADDITIONAL WORD |
| CHARACTERS, defined as described above. Below that is a section of FOR REVIEW characters, |
| sorted by Unicode general category (an additional category of XX is added for the odd characters |
| whose names include: <span style="font-variant: small-caps">MUSICAL SYMBOL, DINGBAT, or RADICAL</span>.) |
| We need review of that list to check for characters that are needed for words in modern languages, |
| that is, that should be moved up into the ADDITIONAL WORD CHARACTERS list. Each character in the FOR |
| REVIEW list is collected because it either: </p> |
| <ol> |
| <li>would not otherwise count as part of an XID, or</li> |
| <li>is part of a bicameral script and doesn't have an uppercase (eg, the situation for U+026B |
| above)</li> |
| </ol> |
| <p>In either case there is prima facie reason for some level of scrutiny, if the goal to be |
| initially conservative in repertoire.</p> |
| <h2><a name="Categorization">Categorization</a></h2> |