blob: f7aa80a7eba9619af4e8d30f8677557cc830ff49 [file] [log] [blame]
<html>
<head>
<meta http-equiv="Content-Language" content="en-us">
<meta name="GENERATOR" content="Microsoft FrontPage 5.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>IDN Characters</title>
<link rel="stylesheet" type="text/css" href="idn-chars.css">
</head>
<body>
<h1>IDN Character Categorization</h1>
<p><i>%date%, MED</i></p>
<p>This page lists all Unicode characters relevant to IDN in a <a href="#Categorization">chart</a>,
broken down by category. Characters are grouped first by script, and then by subcategory.</p>
<p>The &quot;output&quot; IDN characters are ones that can result from nameprep, while the &quot;input&quot; characters
are those that are allowed in input, but transformed (remapped or deleted). Tool-tips provide the
character code and name (in enabled browsers). The following table described the subcategories.
Within each subcategory characters are sorted according to the default
<a href="http://www.unicode.org/reports/tr10/">UCA</a> order.</p>
<blockquote>
<table border="1" cellpadding="2" cellspacing="0">
<caption><b><font size="4">Key</font></b></caption>
<tr>
<th>Type</th>
<th>Subcategory</th>
<th>Description</th>
</tr>
<tr>
<th rowspan="5">Output</th>
<td class="Atomic"><a name="Atomic">Atomic</a></td>
<td>Characters that don&#39;t fall into any of the following subcategories</td>
</tr>
<tr>
<td class="Atomic-no-uppercase"><a name="Atomic-no-uppercase">Atomic-no-uppercase</a></td>
<td>For bicameral scripts, Atomic characters without an uppercase. These need to be examined
to see which are used in modern languages.</td>
</tr>
<tr>
<td class="Pattern_Syntax"><a name="Pattern_Syntax">Pattern_Syntax</a></td>
<td>Characters recommended as a basis for use in pattern syntax. Excludes the
<a href="#Word_Characters">additional word characters</a>.</td>
</tr>
<tr>
<td class="Non-XID"><a name="Non-XID">Non-XID</a></td>
<td>Characters not recommended as a basis for identifiers, excluding Pattern_Syntax and
<a href="#Word_Characters">additional word characters</a>.</td>
</tr>
<tr>
<td class="NFD-Decomposable"><a name="NFD-Decomposable">NFD-Decomposable</a></td>
<td>Characters with NFD (canonical) decompositions. These are broken out separately because
certain spoofing techniques are applied to them <i>via their decompositions.</i></td>
</tr>
<tr>
<th rowspan="4">Input</th>
<td class="IDN-Remapped-Case-Atomic"><a name="IDN-Remapped-Case-Atomic">
IDN-Remapped-Case-Atomic</a></td>
<td>Atomic characters remapped by IDN due to case folding [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a>
Section 3.2].</td>
</tr>
<tr>
<td class="IDN-Remapped-Case-NFD-Decomposable"><a name="IDN-Remapped-Case-NFD-Decomposable">
IDN-Remapped-Case-NFD-Decomposable</a></td>
<td>Characters that are NFD (canonical) decomposable and that are remapped by IDN due to case
folding [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a> Section 3.2].</td>
</tr>
<tr>
<td class="IDN-Remapped-Compat"><a name="IDN-Remapped-Compat">IDN-Remapped</a></td>
<td>Characters remapped by IDN due to compatibility (NFKD) mapping. [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a>
Section 4]</td>
</tr>
<tr>
<td class="IDN-Deleted"><a name="IDN-Deleted">IDN-Deleted</a></td>
<td>Characters deleted by IDN, that is, mapped to nothing [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a>
Section 3.1]</td>
</tr>
<tr>
<th>Prohibited</th>
<td class="IDN-Prohibited"><a name="IDN-Prohibited">IDN-Prohibited </a></td>
<td>Characters prohibited in IDN [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a>
Section 5] (Note: most of these are due to IDN&#39;s using an old version of Unicode. IDN does
treat unassigned characters differently than explicitly prohibited characters, but for our
purposes this distinction doesn&#39;t matter.)</td>
</tr>
</table>
</blockquote>
<p>Characters that are normally invisible are represented in the chart by their Unicode number, such as "U+FE00".</p>
<p>At the end of this document, there is an additional section that lists all <a href='#Visible_Combining_Marks_0'>visible non-spacing marks</a>.
These are sorted first by combining character class (modified), then by script, then by code point..</p>
<p>For comparison of Indic characters, see <a href='indic-trans.html'>indic-trans.html</a>.</p>
<h3>Additional <a name="Word_Characters">Word Characters</a></h3>
<p>This is a draft list of characters based on <i>Section 4 Word Boundaries</i> of
<a href="http://www.unicode.org/reports/tr29/tr29-9.html#Word_Boundaries">UAX# 29</a>, in the
Word_Break property and notes at the end of the section. While not currently a part of the
recommended characters for programming identifiers (XID_Continue), these characters have been
identified as being necessary for more &quot;natural language&quot; identifiers, since some words in some
modern languages could not be constructed without them. See also
<a href="http://www.unicode.org/reports/tr31/tr31-5.html">UAX #31: Identifier and Pattern Syntax</a>.
These characters are listed in the plain text file, as described below.</p>
<h2>Plain-Text Version</h2>
<p>The information in the categorization is also available in a plain-text file, at
<a href="idn-chars.txt">idn-chars.txt</a>. It can be viewed as is, or loaded into a spreadsheet for
sorting and filtering to view the data in different ways. The format is:</p>
<blockquote>
<p>code ; script ; subcategory # general-category (character) character-name</p>
</blockquote>
<p><i>Examples:</i></p>
<pre>0061 ; LATIN ; Atomic # ; L&amp; (a) LATIN SMALL LETTER A
<code>026B ; LATIN ; Atomic-no-uppercase # L&amp; (?) LATIN SMALL LETTER L WITH MIDDLE TILDE</code>
2015 ; COMMON ; Pattern_Syntax # Pd (―) HORIZONTAL BAR
058A ; ARMENIAN ; Atomic-no-uppercase # ; Pd (֊) ARMENIAN HYPHEN
20AC ; COMMON ; Non-XID # ; Sc (€) EURO SIGN</pre>
<p>At the end of <a href="idn-chars.txt">idn-chars.txt</a> is a section called ADDITIONAL WORD
CHARACTERS, defined as described above. Below that is a section of FOR REVIEW characters,
sorted by Unicode general category (an additional category of XX is added for the odd characters
whose names include: <span style="font-variant: small-caps">MUSICAL SYMBOL, DINGBAT, or RADICAL</span>.)
We need review of that list to check for characters that are needed for words in modern languages,
that is, that should be moved up into the ADDITIONAL WORD CHARACTERS list. Each character in the FOR
REVIEW list is collected because it either: </p>
<ol>
<li>would not otherwise count as part of an XID, or</li>
<li>is part of a bicameral script and doesn&#39;t have an uppercase (eg, the situation for U+026B
above)</li>
</ol>
<p>In either case there is prima facie reason for some level of scrutiny, if the goal to be
initially conservative in repertoire.</p>
<h2><a name="Categorization">Categorization</a></h2>