unicodetools/com/ibm/text/UCD/idn-charsHeader.html - external/github.com/unicode-org/icu - Git at Google

 <html>

 <head>
 <meta http-equiv="Content-Language" content="en-us">
 <meta name="GENERATOR" content="Microsoft FrontPage 5.0">
 <meta name="ProgId" content="FrontPage.Editor.Document">
 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
 <title>IDN Characters</title>
 <link rel="stylesheet" type="text/css" href="idn-chars.css">
 </head>

 <body>

 <h1>IDN Character Categorization</h1>
 <p><i>%date%, MED</i></p>
 <p>This page lists all Unicode characters relevant to IDN in a <a href="#Categorization">chart</a>,
 broken down by category. Characters are grouped first by script, and then by subcategory.</p>
 <p>The &quot;output&quot; IDN characters are ones that can result from nameprep, while the &quot;input&quot; characters
 are those that are allowed in input, but transformed (remapped or deleted). Tool-tips provide the
 character code and name (in enabled browsers). The following table described the subcategories.
 Within each subcategory characters are sorted according to the default
 <a href="http://www.unicode.org/reports/tr10/">UCA</a> order.</p>
 <blockquote>
   <table border="1" cellpadding="2" cellspacing="0">
     <caption><b><font size="4">Key</font></b></caption>
     <tr>
       <th>Type</th>
       <th>Subcategory</th>
       <th>Description</th>
     </tr>
     <tr>
       <th rowspan="5">Output</th>
       <td class="Atomic"><a name="Atomic">Atomic</a></td>
       <td>Characters that don&#39;t fall into any of the following subcategories</td>
     </tr>
     <tr>
       <td class="Atomic-no-uppercase"><a name="Atomic-no-uppercase">Atomic-no-uppercase</a></td>
       <td>For bicameral scripts, Atomic characters without an uppercase. These need to be examined
       to see which are used in modern languages.</td>
     </tr>
     <tr>
       <td class="Pattern_Syntax"><a name="Pattern_Syntax">Pattern_Syntax</a></td>
       <td>Characters recommended as a basis for use in pattern syntax. Excludes the
       <a href="#Word_Characters">additional word characters</a>.</td>
     </tr>
     <tr>
       <td class="Non-XID"><a name="Non-XID">Non-XID</a></td>
       <td>Characters not recommended as a basis for identifiers, excluding Pattern_Syntax and
       <a href="#Word_Characters">additional word characters</a>.</td>
     </tr>
     <tr>
       <td class="NFD-Decomposable"><a name="NFD-Decomposable">NFD-Decomposable</a></td>
       <td>Characters with NFD (canonical) decompositions. These are broken out separately because
       certain spoofing techniques are applied to them <i>via their decompositions.</i></td>
     </tr>
     <tr>
       <th rowspan="4">Input</th>
       <td class="IDN-Remapped-Case-Atomic"><a name="IDN-Remapped-Case-Atomic">
       IDN-Remapped-Case-Atomic</a></td>
       <td>Atomic characters remapped by IDN due to case folding [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a>
       Section 3.2].</td>
     </tr>
     <tr>
       <td class="IDN-Remapped-Case-NFD-Decomposable"><a name="IDN-Remapped-Case-NFD-Decomposable">
       IDN-Remapped-Case-NFD-Decomposable</a></td>
       <td>Characters that are NFD (canonical) decomposable and that are remapped by IDN due to case
       folding [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a> Section 3.2].</td>
     </tr>
     <tr>
       <td class="IDN-Remapped-Compat"><a name="IDN-Remapped-Compat">IDN-Remapped</a></td>
       <td>Characters remapped by IDN due to compatibility (NFKD) mapping. [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a>
       Section 4]</td>
     </tr>
     <tr>
       <td class="IDN-Deleted"><a name="IDN-Deleted">IDN-Deleted</a></td>
       <td>Characters deleted by IDN, that is, mapped to nothing [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a>
       Section 3.1]</td>
     </tr>
     <tr>
       <th>Prohibited</th>
       <td class="IDN-Prohibited"><a name="IDN-Prohibited">IDN-Prohibited </a></td>
       <td>Characters prohibited in IDN [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a>
       Section 5] (Note: most of these are due to IDN&#39;s using an old version of Unicode. IDN does
       treat unassigned characters differently than explicitly prohibited characters, but for our
       purposes this distinction doesn&#39;t matter.)</td>
     </tr>
   </table>
 </blockquote>
 <p>Characters that are normally invisible are represented in the chart by their Unicode number, such as "U+FE00".</p>
 <p>At the end of this document, there is an additional section that lists all <a href='#Visible_Combining_Marks_0'>visible non-spacing marks</a>.
 These are sorted first by combining character class (modified), then by script, then by code point..</p>
 <p>For comparison of Indic characters, see <a href='indic-trans.html'>indic-trans.html</a>.</p>
 <h3>Additional <a name="Word_Characters">Word Characters</a></h3>
 <p>This is a draft list of characters based on <i>Section 4 Word Boundaries</i> of
 <a href="http://www.unicode.org/reports/tr29/tr29-9.html#Word_Boundaries">UAX# 29</a>, in the
 Word_Break property and notes at the end of the section. While not currently a part of the
 recommended characters for programming identifiers (XID_Continue), these characters have been
 identified as being necessary for more &quot;natural language&quot; identifiers, since some words in some
 modern languages could not be constructed without them. See also
 <a href="http://www.unicode.org/reports/tr31/tr31-5.html">UAX #31: Identifier and Pattern Syntax</a>.
 These characters are listed in the plain text file, as described below.</p>
 <h2>Plain-Text Version</h2>
 <p>The information in the categorization is also available in a plain-text file, at
 <a href="idn-chars.txt">idn-chars.txt</a>. It can be viewed as is, or loaded into a spreadsheet for
 sorting and filtering to view the data in different ways. The format is:</p>
 <blockquote>
   <p>code ; script ; subcategory # general-category (character) character-name</p>
 </blockquote>
 <p><i>Examples:</i></p>
 <pre>0061          ; LATIN ; Atomic # ; L&amp; (a) LATIN SMALL LETTER A
 <code>026B          ; LATIN ; Atomic-no-uppercase # L&amp; (?) LATIN SMALL LETTER L WITH MIDDLE TILDE</code>
 2015          ; COMMON ; Pattern_Syntax # Pd (―) HORIZONTAL BAR
 058A          ; ARMENIAN ; Atomic-no-uppercase # ; Pd (֊) ARMENIAN HYPHEN
 20AC          ; COMMON ; Non-XID # ; Sc (€) EURO SIGN</pre>
 <p>At the end of <a href="idn-chars.txt">idn-chars.txt</a> is a section called ADDITIONAL WORD
 CHARACTERS, defined as described above. Below that is a section of FOR REVIEW characters,
 sorted by Unicode general category (an additional category of XX is added for the odd characters
 whose names include: <span style="font-variant: small-caps">MUSICAL SYMBOL, DINGBAT, or RADICAL</span>.)
 We need review of that list to check for characters that are needed for words in modern languages,
 that is, that should be moved up into the ADDITIONAL WORD CHARACTERS list. Each character in the FOR
 REVIEW list is collected because it either: </p>
 <ol>
   <li>would not otherwise count as part of an XID, or</li>
   <li>is part of a bicameral script and doesn&#39;t have an uppercase (eg, the situation for U+026B
   above)</li>
 </ol>
 <p>In either case there is prima facie reason for some level of scrutiny, if the goal to be
 initially conservative in repertoire.</p>
 <h2><a name="Categorization">Categorization</a></h2>
	<html>

	<head>
	<meta http-equiv="Content-Language" content="en-us">
	<meta name="GENERATOR" content="Microsoft FrontPage 5.0">
	<meta name="ProgId" content="FrontPage.Editor.Document">
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
	<title>IDN Characters</title>
	<link rel="stylesheet" type="text/css" href="idn-chars.css">
	</head>

	<body>

	<h1>IDN Character Categorization</h1>
	<p><i>%date%, MED</i></p>
	<p>This page lists all Unicode characters relevant to IDN in a <a href="#Categorization">chart</a>,
	broken down by category. Characters are grouped first by script, and then by subcategory.</p>
	<p>The "output" IDN characters are ones that can result from nameprep, while the "input" characters
	are those that are allowed in input, but transformed (remapped or deleted). Tool-tips provide the
	character code and name (in enabled browsers). The following table described the subcategories.
	Within each subcategory characters are sorted according to the default
	<a href="http://www.unicode.org/reports/tr10/">UCA</a> order.</p>
	<blockquote>
	<table border="1" cellpadding="2" cellspacing="0">
	<caption><b><font size="4">Key</font></b></caption>
	<tr>
	<th>Type</th>
	<th>Subcategory</th>
	<th>Description</th>
	</tr>
	<tr>
	<th rowspan="5">Output</th>
	<td class="Atomic"><a name="Atomic">Atomic</a></td>
	<td>Characters that don't fall into any of the following subcategories</td>
	</tr>
	<tr>
	<td class="Atomic-no-uppercase"><a name="Atomic-no-uppercase">Atomic-no-uppercase</a></td>
	<td>For bicameral scripts, Atomic characters without an uppercase. These need to be examined
	to see which are used in modern languages.</td>
	</tr>
	<tr>
	<td class="Pattern_Syntax"><a name="Pattern_Syntax">Pattern_Syntax</a></td>
	<td>Characters recommended as a basis for use in pattern syntax. Excludes the
	<a href="#Word_Characters">additional word characters</a>.</td>
	</tr>
	<tr>
	<td class="Non-XID"><a name="Non-XID">Non-XID</a></td>
	<td>Characters not recommended as a basis for identifiers, excluding Pattern_Syntax and
	<a href="#Word_Characters">additional word characters</a>.</td>
	</tr>
	<tr>
	<td class="NFD-Decomposable"><a name="NFD-Decomposable">NFD-Decomposable</a></td>
	<td>Characters with NFD (canonical) decompositions. These are broken out separately because
	certain spoofing techniques are applied to them <i>via their decompositions.</i></td>
	</tr>
	<tr>
	<th rowspan="4">Input</th>
	<td class="IDN-Remapped-Case-Atomic"><a name="IDN-Remapped-Case-Atomic">
	IDN-Remapped-Case-Atomic</a></td>
	<td>Atomic characters remapped by IDN due to case folding [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a>
	Section 3.2].</td>
	</tr>
	<tr>
	<td class="IDN-Remapped-Case-NFD-Decomposable"><a name="IDN-Remapped-Case-NFD-Decomposable">
	IDN-Remapped-Case-NFD-Decomposable</a></td>
	<td>Characters that are NFD (canonical) decomposable and that are remapped by IDN due to case
	folding [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a> Section 3.2].</td>
	</tr>
	<tr>
	<td class="IDN-Remapped-Compat"><a name="IDN-Remapped-Compat">IDN-Remapped</a></td>
	<td>Characters remapped by IDN due to compatibility (NFKD) mapping. [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a>
	Section 4]</td>
	</tr>
	<tr>
	<td class="IDN-Deleted"><a name="IDN-Deleted">IDN-Deleted</a></td>
	<td>Characters deleted by IDN, that is, mapped to nothing [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a>
	Section 3.1]</td>
	</tr>
	<tr>
	<th>Prohibited</th>
	<td class="IDN-Prohibited"><a name="IDN-Prohibited">IDN-Prohibited </a></td>
	<td>Characters prohibited in IDN [<a href="http://ietf.org/rfc/rfc3454.txt">StringPrep</a>
	Section 5] (Note: most of these are due to IDN's using an old version of Unicode. IDN does
	treat unassigned characters differently than explicitly prohibited characters, but for our
	purposes this distinction doesn't matter.)</td>
	</tr>
	</table>
	</blockquote>
	<p>Characters that are normally invisible are represented in the chart by their Unicode number, such as "U+FE00".</p>
	<p>At the end of this document, there is an additional section that lists all <a href='#Visible_Combining_Marks_0'>visible non-spacing marks</a>.
	These are sorted first by combining character class (modified), then by script, then by code point..</p>
	<p>For comparison of Indic characters, see <a href='indic-trans.html'>indic-trans.html</a>.</p>
	<h3>Additional <a name="Word_Characters">Word Characters</a></h3>
	<p>This is a draft list of characters based on <i>Section 4 Word Boundaries</i> of
	<a href="http://www.unicode.org/reports/tr29/tr29-9.html#Word_Boundaries">UAX# 29</a>, in the
	Word_Break property and notes at the end of the section. While not currently a part of the
	recommended characters for programming identifiers (XID_Continue), these characters have been
	identified as being necessary for more "natural language" identifiers, since some words in some
	modern languages could not be constructed without them. See also
	<a href="http://www.unicode.org/reports/tr31/tr31-5.html">UAX #31: Identifier and Pattern Syntax</a>.
	These characters are listed in the plain text file, as described below.</p>
	<h2>Plain-Text Version</h2>
	<p>The information in the categorization is also available in a plain-text file, at
	<a href="idn-chars.txt">idn-chars.txt</a>. It can be viewed as is, or loaded into a spreadsheet for
	sorting and filtering to view the data in different ways. The format is:</p>
	<blockquote>
	<p>code ; script ; subcategory # general-category (character) character-name</p>
	</blockquote>
	<p><i>Examples:</i></p>
	<pre>0061 ; LATIN ; Atomic # ; L& (a) LATIN SMALL LETTER A
	<code>026B ; LATIN ; Atomic-no-uppercase # L& (?) LATIN SMALL LETTER L WITH MIDDLE TILDE</code>
	2015 ; COMMON ; Pattern_Syntax # Pd (―) HORIZONTAL BAR
	058A ; ARMENIAN ; Atomic-no-uppercase # ; Pd (֊) ARMENIAN HYPHEN
	20AC ; COMMON ; Non-XID # ; Sc (€) EURO SIGN</pre>
	<p>At the end of <a href="idn-chars.txt">idn-chars.txt</a> is a section called ADDITIONAL WORD
	CHARACTERS, defined as described above. Below that is a section of FOR REVIEW characters,
	sorted by Unicode general category (an additional category of XX is added for the odd characters
	whose names include: <span style="font-variant: small-caps">MUSICAL SYMBOL, DINGBAT, or RADICAL</span>.)
	We need review of that list to check for characters that are needed for words in modern languages,
	that is, that should be moved up into the ADDITIONAL WORD CHARACTERS list. Each character in the FOR
	REVIEW list is collected because it either: </p>
	<ol>
	<li>would not otherwise count as part of an XID, or</li>
	<li>is part of a bicameral script and doesn't have an uppercase (eg, the situation for U+026B
	above)</li>
	</ol>
	<p>In either case there is prima facie reason for some level of scrutiny, if the goal to be
	initially conservative in repertoire.</p>
	<h2><a name="Categorization">Categorization</a></h2>