unicodetools/com/ibm/text/UCD/UCD-in-XML-Notes.htm - external/github.com/unicode-org/icu - Git at Google

 <html>

 <head>
 <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
 <meta name="GENERATOR" content="Microsoft FrontPage 4.0">
 <meta name="ProgId" content="FrontPage.Editor.Document">
 <title>Unicode Character Database</title>
 <style>
 <!--
 table        { padding: 4 }
 td           { padding: 4 }
 -->
 </style>
 </head>

 <body>

 <span class="cb" id style="DISPLAY: block">
 <h1 align="center">Unicode Character Database (UCD) in XML Format</h1>
 <h1 align="center"><b><font color="#FF0000">WARNING: FORMAT IS DRAFT!</font></b></h1>
 <p align="center">MD 2000.10.16</p>
 <table border="1" width="40%" align="right" cellspacing="4" cellpadding="0">
   <tr>
     <td width="100%" bgcolor="#C0C0C0"><span class="cb" id
       style="DISPLAY: block">
       <h4 align="center">Using Internet Explorer</h4>
       <p>The UCD-Main.xml file can be read in Internet Explorer (5.0 and above).
       However:</p>
       <ul>
         <li>It may take a few minutes to load completely.</li>
         <li>The XML parser in IE does not appear to be conformant: it seems to
           break on</span> the following valid code points (and others):
         <ul>
           <li>&lt;IEbugs<br>
             c1='&amp;#xFFF9;'<br>
             c2='&amp;#xFFFA;'<br>
             c3='&amp;#xFFFB;'<br>
             c4='&amp;#xFFFC;'<br>
             c5='&amp;#xFFFD;'<br>
             c6='&amp;#xF0000;'<br>
             c7='&amp;#xFFFFD;'<br>
             c8='&amp;#x100000;'<br>
             c9='&amp;#x10FFFD;'/&gt;</li>
         </ul>
       </li>
       </ul>
     </td>
   </tr>
 </table>
 <p><a href="UCD-Main.xml">UCD-Main.xml</a> provides an XML format for the main
 files in the Unicode Character Database. These include:</p>
 <ul>
   <li><code>UnicodeData.txt</code></li>
   <li><code>ArabicShaping.txt</code></li>
   <li><code>Jamo.txt</code></li>
   <li><code>SpecialCasing.txt</code></li>
   <li><code>CompositionExclusions.txt</code></li>
   <li><code>EastAsianWidth.txt</code></li>
   <li><code>LineBreak.txt</code></li>
   <li><code>BidiMirroring.txt</code></li>
   <li><code>CaseFolding.txt</code></li>
   <li><code>Blocks.txt</code></li>
   <li><code>PropList.alpha.txt</code></li>
 </ul>
 <p>Other files in the UCD have very different structure or purpose, and are best
 expressed with separate files. Some annotational data, such as that in
 NamesList.txt or the 10646 comment in UnicodeData, is also best served with
 separate files. The current UCD files not yet in XML format are:</p>
 <ul>
   <li><code>Unihan.txt</code></li>
   <li><code>NamesList.txt</code></li>
   <li><code>Index.txt</code></li>
   <li><code>NormalizationTest.txt</code></li>
 </ul>
 <h3>Format</h3>
 <p>The Unicode blocks are provided as a list of &lt;block .../&gt; elements,
 with attributes providing the start, end, and name.</p>
 <p>Each assigned code point is a &lt;e .../&gt; element, with attributes
 supplying specific properties. The meaning of the attributes is specified below.
 There is one exception: large ranges of code points&nbsp; for characters such as
 Hangul Syllables are abbreviated by indicating the start and end of the range.</p>
 <p>Because of the volume of data, the attribute names are abbreviated. A <a
 href="#AttributeAbbreviations">key</a> explains the abbreviations, and relates
 them to the fields and values of the original UCD semicolon-delimited files.
 With few exceptions, the values in the XML are directly copied from data in the
 original UCD semicolon-delimited files. Those exceptions are described <a
 href="http://www.unicode.org/Public/3.0-Update1/UnicodeCharacterDatabase-3.0.1.html#DataModifications">below</a>.</p>
 <p>Numeric character references (NCRs) are used to encode the Unicode code
 points. Some Unicode code points cannot be transmitted in XML, even as NCRs (see
 <a href="http://www.w3.org/TR/REC-xml#charsets">http://www.w3.org/TR/REC-xml#charsets</a>),
 or would not be visibly distinct (TAB, CR, LF) in the data. Such code points are
 represented by '#xX;', where X is a hex number.</p>
 <h3><a name="AttributeAbbreviations">Attribute Abbreviations</a></h3>
 <p>To reduce the size of the document, the following attribute abbreviations are
 used. If an attribute is missing, that means it gets a default value. The
 defaults are listed in parentheses below. If there is no specific default, then
 a missing attribute should be read as N/A (not applicable). A default with '='
 means the default is the value of another other field (recursively!). Thus if
 the titlecase attribute is missing, then the value is the same as the uppercase.
 If that in turn is missing, then the value is the same as the code point itself.</p>
 <p>For a description of the source files, see <a
 href="http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html">UnicodeCharacterDatabase.html</a>.
 That file also has links to the descriptions of the fields within the files.
 Since the PropList values are so long, they will probably also be abbreviated in
 the future.</p>
 <table border="1" width="100%">
   <tr>
     <td width="50%" valign="top"><span class="cb" id style="DISPLAY: block">
       <h4>UnicodeData</h4>
       <p>&nbsp; c: code point<br>
       &nbsp; n: name<br>
       &nbsp; gc: general category (Lo)<br>
       &nbsp; cc: combining class (0)<br>
       &nbsp; bc: bidi category (L)<br>
       &nbsp; dm: decomposition mapping<br>
       &nbsp; dt: decomposition type (canonical)<br>
       &nbsp; nt: numeric type<br>
       &nbsp; nv: numeric value<br>
       &nbsp; bm: bidi mirrored (N)<br>
       &nbsp; uc: uppercase (=c)<br>
       &nbsp; lc: lowercase (=c)<br>
       &nbsp; tc: titlecase (=uc)</p>
       <h4>SpecialCasing:</h4>
       <p>&nbsp; sl: special lower (=lc)<br>
       &nbsp; su: special upper (=uc)<br>
       &nbsp; st: special title (=su)<br>
       &nbsp; sc: special case condition</p>
       <h4>CaseFolding:</h4>
       <p>&nbsp; fc: foldcase (=sl)</span></td>
     <td width="50%" valign="top"><span class="cb" id style="DISPLAY: block">
       <h4>CompositionExclusions:</h4>
       <p>&nbsp; ce: composition exclusion (N)</p>
       <h4>EastAsianWidth:</h4>
       <p>&nbsp; ea: east asian width (N)</p>
       <h4>Jamo:</h4>
       <p>&nbsp; jn: jamo name</p>
       <h4>LineBreak:</h4>
       <p>&nbsp; lb: line break class (AL)</p>
       <h4>ArabicShaping:</h4>
       <p>&nbsp; jt: joining type<br>
       &nbsp; jg: joining group</p>
       <h4>BidiMirroring:</h4>
       <p>&nbsp; bg: bidi mirroring glyph (=c)</p>
       <p><b>PropList:</b></p>
       <p>&nbsp; xs: space-delimited list of properties from the file</p>
       <p><b><i>WARNING: these values are likely to change!</i></b></span></td>
   </tr>
 </table>
 <br>
 <h3><a name="DataModifications">Data Modifications</a></h3>
 </span>
 <p>The XML format is generated from the original semicolon-delimited UCD files.
 In general, all fields and values are direct copies. However, there are some
 changes, detailed below.</p>
 <h4>1. Some redundant or annotational fields are omitted</h4>
 <table border="1" width="100%">
   <tr>
     <td width="50%" valign="top"><b>UnicodeData<br>
       </b>1.0 Name<br>
       10646 comment<br>
       <br>
       <b>CaseFolding<br>
       </b>Type (since it is computable from whether the fold equals the normal
       lowercase)
       <p><b>ArabicShaping<br>
       </b>Name<br>
       <br>
       <b>EastAsianWidth<br>
       </b>Name<br>
       <br>
       <b>LineBreak<br>
       </b>Name</p>
     </td>
     <td width="50%" valign="top"><b>PropList</b><font face="Times New Roman"
       color="#000000">
       <p>The fields are based on the proposed PropList.alpha, which changes the
       fields considerably.</p>
       </font>
       <p><span class="cb" id style="display: block"><b><i>WARNING: other values
       are also likely to change!</i></b></span></p>
     </td>
   </tr>
 </table>
 <h4>2. Some fields are broken into several fields; others may be combined into a
 single field</h4>
 <ul>
   <li><b>dt: </b>decomposition tag
     <ul>
       <li>the 'tag' field extracted from the decomposition mapping. If there is
         no tag, the value is &quot;canonical&quot;. Only has meaning if there is
         a decomposition (<b>dm</b>).</li>
     </ul>
   </li>
   <li><b>nt: </b>numeric type
     <ul>
       <li>an enumeration [decimal, digit, numeric] for the type of number. It
         replaces having duplicate field values for numbers</li>
     </ul>
   </li>
   <li><b>rg: </b>range
     <ul>
       <li>used for ranges of values that share characteristics, instead of
         having to do a substring check.<br>
         &quot;START&quot; corresponds to &quot;&lt;..., First&gt;&quot;<br>
         &quot;END&quot; corresponds to &quot;&lt;..., Last&gt;&quot;</li>
     </ul>
   </li>
   <li><b>nc: </b>name computed
     <ul>
       <li>if &quot;COMPUTED&quot;, indicates that the name must be computed:
         e.g. Hangul Syllables, Ideographs</li>
     </ul>
   </li>
   <li><b>na: </b>name annotation
     <ul>
       <li>used for code points that do not really have associated names, like
         control characters and private use characters. The data in that case is
         either extracted from the &quot;&lt;...&gt;&quot; style name in the old
         format, or gotten from the &quot;1.0 Unicode name&quot;.</li>
     </ul>
   </li>
 </ul>

 </body>

 </html>
	<html>

	<head>
	<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
	<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
	<meta name="ProgId" content="FrontPage.Editor.Document">
	<title>Unicode Character Database</title>
	<style>
	<!--
	table { padding: 4 }
	td { padding: 4 }
	-->
	</style>
	</head>

	<body>

	<span class="cb" id style="DISPLAY: block">
	<h1 align="center">Unicode Character Database (UCD) in XML Format</h1>
	<h1 align="center"><b><font color="#FF0000">WARNING: FORMAT IS DRAFT!</font></b></h1>
	<p align="center">MD 2000.10.16</p>
	<table border="1" width="40%" align="right" cellspacing="4" cellpadding="0">
	<tr>
	<td width="100%" bgcolor="#C0C0C0"><span class="cb" id
	style="DISPLAY: block">
	<h4 align="center">Using Internet Explorer</h4>
	<p>The UCD-Main.xml file can be read in Internet Explorer (5.0 and above).
	However:</p>
	<ul>
	<li>It may take a few minutes to load completely.</li>
	<li>The XML parser in IE does not appear to be conformant: it seems to
	break on</span> the following valid code points (and others):
	<ul>
	<li><IEbugs<br>
	c1='&#xFFF9;'<br>
	c2='&#xFFFA;'<br>
	c3='&#xFFFB;'<br>
	c4='&#xFFFC;'<br>
	c5='&#xFFFD;'<br>
	c6='&#xF0000;'<br>
	c7='&#xFFFFD;'<br>
	c8='&#x100000;'<br>
	c9='&#x10FFFD;'/></li>
	</ul>
	</li>
	</ul>
	</td>
	</tr>
	</table>
	<p><a href="UCD-Main.xml">UCD-Main.xml</a> provides an XML format for the main
	files in the Unicode Character Database. These include:</p>
	<ul>
	<li><code>UnicodeData.txt</code></li>
	<li><code>ArabicShaping.txt</code></li>
	<li><code>Jamo.txt</code></li>
	<li><code>SpecialCasing.txt</code></li>
	<li><code>CompositionExclusions.txt</code></li>
	<li><code>EastAsianWidth.txt</code></li>
	<li><code>LineBreak.txt</code></li>
	<li><code>BidiMirroring.txt</code></li>
	<li><code>CaseFolding.txt</code></li>
	<li><code>Blocks.txt</code></li>
	<li><code>PropList.alpha.txt</code></li>
	</ul>
	<p>Other files in the UCD have very different structure or purpose, and are best
	expressed with separate files. Some annotational data, such as that in
	NamesList.txt or the 10646 comment in UnicodeData, is also best served with
	separate files. The current UCD files not yet in XML format are:</p>
	<ul>
	<li><code>Unihan.txt</code></li>
	<li><code>NamesList.txt</code></li>
	<li><code>Index.txt</code></li>
	<li><code>NormalizationTest.txt</code></li>
	</ul>
	<h3>Format</h3>
	<p>The Unicode blocks are provided as a list of <block .../> elements,
	with attributes providing the start, end, and name.</p>
	<p>Each assigned code point is a <e .../> element, with attributes
	supplying specific properties. The meaning of the attributes is specified below.
	There is one exception: large ranges of code points  for characters such as
	Hangul Syllables are abbreviated by indicating the start and end of the range.</p>
	<p>Because of the volume of data, the attribute names are abbreviated. A <a
	href="#AttributeAbbreviations">key</a> explains the abbreviations, and relates
	them to the fields and values of the original UCD semicolon-delimited files.
	With few exceptions, the values in the XML are directly copied from data in the
	original UCD semicolon-delimited files. Those exceptions are described <a
	href="http://www.unicode.org/Public/3.0-Update1/UnicodeCharacterDatabase-3.0.1.html#DataModifications">below</a>.</p>
	<p>Numeric character references (NCRs) are used to encode the Unicode code
	points. Some Unicode code points cannot be transmitted in XML, even as NCRs (see
	<a href="http://www.w3.org/TR/REC-xml#charsets">http://www.w3.org/TR/REC-xml#charsets</a>),
	or would not be visibly distinct (TAB, CR, LF) in the data. Such code points are
	represented by '#xX;', where X is a hex number.</p>
	<h3><a name="AttributeAbbreviations">Attribute Abbreviations</a></h3>
	<p>To reduce the size of the document, the following attribute abbreviations are
	used. If an attribute is missing, that means it gets a default value. The
	defaults are listed in parentheses below. If there is no specific default, then
	a missing attribute should be read as N/A (not applicable). A default with '='
	means the default is the value of another other field (recursively!). Thus if
	the titlecase attribute is missing, then the value is the same as the uppercase.
	If that in turn is missing, then the value is the same as the code point itself.</p>
	<p>For a description of the source files, see <a
	href="http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html">UnicodeCharacterDatabase.html</a>.
	That file also has links to the descriptions of the fields within the files.
	Since the PropList values are so long, they will probably also be abbreviated in
	the future.</p>
	<table border="1" width="100%">
	<tr>
	<td width="50%" valign="top"><span class="cb" id style="DISPLAY: block">
	<h4>UnicodeData</h4>
	<p>  c: code point<br>
	n: name<br>
	gc: general category (Lo)<br>
	cc: combining class (0)<br>
	bc: bidi category (L)<br>
	dm: decomposition mapping<br>
	dt: decomposition type (canonical)<br>
	nt: numeric type<br>
	nv: numeric value<br>
	bm: bidi mirrored (N)<br>
	uc: uppercase (=c)<br>
	lc: lowercase (=c)<br>
	tc: titlecase (=uc)</p>
	<h4>SpecialCasing:</h4>
	<p>  sl: special lower (=lc)<br>
	su: special upper (=uc)<br>
	st: special title (=su)<br>
	sc: special case condition</p>
	<h4>CaseFolding:</h4>
	<p>  fc: foldcase (=sl)</span></td>
	<td width="50%" valign="top"><span class="cb" id style="DISPLAY: block">
	<h4>CompositionExclusions:</h4>
	<p>  ce: composition exclusion (N)</p>
	<h4>EastAsianWidth:</h4>
	<p>  ea: east asian width (N)</p>
	<h4>Jamo:</h4>
	<p>  jn: jamo name</p>
	<h4>LineBreak:</h4>
	<p>  lb: line break class (AL)</p>
	<h4>ArabicShaping:</h4>
	<p>  jt: joining type<br>
	jg: joining group</p>
	<h4>BidiMirroring:</h4>
	<p>  bg: bidi mirroring glyph (=c)</p>
	<p><b>PropList:</b></p>
	<p>  xs: space-delimited list of properties from the file</p>
	<p><b><i>WARNING: these values are likely to change!</i></b></span></td>
	</tr>
	</table>
	<br>
	<h3><a name="DataModifications">Data Modifications</a></h3>
	</span>
	<p>The XML format is generated from the original semicolon-delimited UCD files.
	In general, all fields and values are direct copies. However, there are some
	changes, detailed below.</p>
	<h4>1. Some redundant or annotational fields are omitted</h4>
	<table border="1" width="100%">
	<tr>
	<td width="50%" valign="top"><b>UnicodeData<br>
	</b>1.0 Name<br>
	10646 comment<br>
	<br>
	<b>CaseFolding<br>
	</b>Type (since it is computable from whether the fold equals the normal
	lowercase)
	<p><b>ArabicShaping<br>
	</b>Name<br>
	<br>
	<b>EastAsianWidth<br>
	</b>Name<br>
	<br>
	<b>LineBreak<br>
	</b>Name</p>
	</td>
	<td width="50%" valign="top"><b>PropList</b><font face="Times New Roman"
	color="#000000">
	<p>The fields are based on the proposed PropList.alpha, which changes the
	fields considerably.</p>
	</font>
	<p><span class="cb" id style="display: block"><b><i>WARNING: other values
	are also likely to change!</i></b></span></p>
	</td>
	</tr>
	</table>
	<h4>2. Some fields are broken into several fields; others may be combined into a
	single field</h4>
	<ul>
	<li><b>dt: </b>decomposition tag
	<ul>
	<li>the 'tag' field extracted from the decomposition mapping. If there is
	no tag, the value is "canonical". Only has meaning if there is
	a decomposition (<b>dm</b>).</li>
	</ul>
	</li>
	<li><b>nt: </b>numeric type
	<ul>
	<li>an enumeration [decimal, digit, numeric] for the type of number. It
	replaces having duplicate field values for numbers</li>
	</ul>
	</li>
	<li><b>rg: </b>range
	<ul>
	<li>used for ranges of values that share characteristics, instead of
	having to do a substring check.<br>
	"START" corresponds to "<..., First>"<br>
	"END" corresponds to "<..., Last>"</li>
	</ul>
	</li>
	<li><b>nc: </b>name computed
	<ul>
	<li>if "COMPUTED", indicates that the name must be computed:
	e.g. Hangul Syllables, Ideographs</li>
	</ul>
	</li>
	<li><b>na: </b>name annotation
	<ul>
	<li>used for code points that do not really have associated names, like
	control characters and private use characters. The data in that case is
	either extracted from the "<...>" style name in the old
	format, or gotten from the "1.0 Unicode name".</li>
	</ul>
	</li>
	</ul>

	</body>

	</html>