| <html> |
| |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> |
| <meta name="GENERATOR" content="Microsoft FrontPage 4.0"> |
| <meta name="ProgId" content="FrontPage.Editor.Document"> |
| <title>Unicode Character Database</title> |
| <style> |
| <!-- |
| table { padding: 4 } |
| td { padding: 4 } |
| --> |
| </style> |
| </head> |
| |
| <body> |
| |
| <span class="cb" id style="DISPLAY: block"> |
| <h1 align="center">Unicode Character Database (UCD) in XML Format</h1> |
| <h1 align="center"><b><font color="#FF0000">WARNING: FORMAT IS DRAFT!</font></b></h1> |
| <p align="center">MD 2000.10.16</p> |
| <table border="1" width="40%" align="right" cellspacing="4" cellpadding="0"> |
| <tr> |
| <td width="100%" bgcolor="#C0C0C0"><span class="cb" id |
| style="DISPLAY: block"> |
| <h4 align="center">Using Internet Explorer</h4> |
| <p>The UCD-Main.xml file can be read in Internet Explorer (5.0 and above). |
| However:</p> |
| <ul> |
| <li>It may take a few minutes to load completely.</li> |
| <li>The XML parser in IE does not appear to be conformant: it seems to |
| break on</span> the following valid code points (and others): |
| <ul> |
| <li><IEbugs<br> |
| c1='&#xFFF9;'<br> |
| c2='&#xFFFA;'<br> |
| c3='&#xFFFB;'<br> |
| c4='&#xFFFC;'<br> |
| c5='&#xFFFD;'<br> |
| c6='&#xF0000;'<br> |
| c7='&#xFFFFD;'<br> |
| c8='&#x100000;'<br> |
| c9='&#x10FFFD;'/></li> |
| </ul> |
| </li> |
| </ul> |
| </td> |
| </tr> |
| </table> |
| <p><a href="UCD-Main.xml">UCD-Main.xml</a> provides an XML format for the main |
| files in the Unicode Character Database. These include:</p> |
| <ul> |
| <li><code>UnicodeData.txt</code></li> |
| <li><code>ArabicShaping.txt</code></li> |
| <li><code>Jamo.txt</code></li> |
| <li><code>SpecialCasing.txt</code></li> |
| <li><code>CompositionExclusions.txt</code></li> |
| <li><code>EastAsianWidth.txt</code></li> |
| <li><code>LineBreak.txt</code></li> |
| <li><code>BidiMirroring.txt</code></li> |
| <li><code>CaseFolding.txt</code></li> |
| <li><code>Blocks.txt</code></li> |
| <li><code>PropList.alpha.txt</code></li> |
| </ul> |
| <p>Other files in the UCD have very different structure or purpose, and are best |
| expressed with separate files. Some annotational data, such as that in |
| NamesList.txt or the 10646 comment in UnicodeData, is also best served with |
| separate files. The current UCD files not yet in XML format are:</p> |
| <ul> |
| <li><code>Unihan.txt</code></li> |
| <li><code>NamesList.txt</code></li> |
| <li><code>Index.txt</code></li> |
| <li><code>NormalizationTest.txt</code></li> |
| </ul> |
| <h3>Format</h3> |
| <p>The Unicode blocks are provided as a list of <block .../> elements, |
| with attributes providing the start, end, and name.</p> |
| <p>Each assigned code point is a <e .../> element, with attributes |
| supplying specific properties. The meaning of the attributes is specified below. |
| There is one exception: large ranges of code points for characters such as |
| Hangul Syllables are abbreviated by indicating the start and end of the range.</p> |
| <p>Because of the volume of data, the attribute names are abbreviated. A <a |
| href="#AttributeAbbreviations">key</a> explains the abbreviations, and relates |
| them to the fields and values of the original UCD semicolon-delimited files. |
| With few exceptions, the values in the XML are directly copied from data in the |
| original UCD semicolon-delimited files. Those exceptions are described <a |
| href="http://www.unicode.org/Public/3.0-Update1/UnicodeCharacterDatabase-3.0.1.html#DataModifications">below</a>.</p> |
| <p>Numeric character references (NCRs) are used to encode the Unicode code |
| points. Some Unicode code points cannot be transmitted in XML, even as NCRs (see |
| <a href="http://www.w3.org/TR/REC-xml#charsets">http://www.w3.org/TR/REC-xml#charsets</a>), |
| or would not be visibly distinct (TAB, CR, LF) in the data. Such code points are |
| represented by '#xX;', where X is a hex number.</p> |
| <h3><a name="AttributeAbbreviations">Attribute Abbreviations</a></h3> |
| <p>To reduce the size of the document, the following attribute abbreviations are |
| used. If an attribute is missing, that means it gets a default value. The |
| defaults are listed in parentheses below. If there is no specific default, then |
| a missing attribute should be read as N/A (not applicable). A default with '=' |
| means the default is the value of another other field (recursively!). Thus if |
| the titlecase attribute is missing, then the value is the same as the uppercase. |
| If that in turn is missing, then the value is the same as the code point itself.</p> |
| <p>For a description of the source files, see <a |
| href="http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html">UnicodeCharacterDatabase.html</a>. |
| That file also has links to the descriptions of the fields within the files. |
| Since the PropList values are so long, they will probably also be abbreviated in |
| the future.</p> |
| <table border="1" width="100%"> |
| <tr> |
| <td width="50%" valign="top"><span class="cb" id style="DISPLAY: block"> |
| <h4>UnicodeData</h4> |
| <p> c: code point<br> |
| n: name<br> |
| gc: general category (Lo)<br> |
| cc: combining class (0)<br> |
| bc: bidi category (L)<br> |
| dm: decomposition mapping<br> |
| dt: decomposition type (canonical)<br> |
| nt: numeric type<br> |
| nv: numeric value<br> |
| bm: bidi mirrored (N)<br> |
| uc: uppercase (=c)<br> |
| lc: lowercase (=c)<br> |
| tc: titlecase (=uc)</p> |
| <h4>SpecialCasing:</h4> |
| <p> sl: special lower (=lc)<br> |
| su: special upper (=uc)<br> |
| st: special title (=su)<br> |
| sc: special case condition</p> |
| <h4>CaseFolding:</h4> |
| <p> fc: foldcase (=sl)</span></td> |
| <td width="50%" valign="top"><span class="cb" id style="DISPLAY: block"> |
| <h4>CompositionExclusions:</h4> |
| <p> ce: composition exclusion (N)</p> |
| <h4>EastAsianWidth:</h4> |
| <p> ea: east asian width (N)</p> |
| <h4>Jamo:</h4> |
| <p> jn: jamo name</p> |
| <h4>LineBreak:</h4> |
| <p> lb: line break class (AL)</p> |
| <h4>ArabicShaping:</h4> |
| <p> jt: joining type<br> |
| jg: joining group</p> |
| <h4>BidiMirroring:</h4> |
| <p> bg: bidi mirroring glyph (=c)</p> |
| <p><b>PropList:</b></p> |
| <p> xs: space-delimited list of properties from the file</p> |
| <p><b><i>WARNING: these values are likely to change!</i></b></span></td> |
| </tr> |
| </table> |
| <br> |
| <h3><a name="DataModifications">Data Modifications</a></h3> |
| </span> |
| <p>The XML format is generated from the original semicolon-delimited UCD files. |
| In general, all fields and values are direct copies. However, there are some |
| changes, detailed below.</p> |
| <h4>1. Some redundant or annotational fields are omitted</h4> |
| <table border="1" width="100%"> |
| <tr> |
| <td width="50%" valign="top"><b>UnicodeData<br> |
| </b>1.0 Name<br> |
| 10646 comment<br> |
| <br> |
| <b>CaseFolding<br> |
| </b>Type (since it is computable from whether the fold equals the normal |
| lowercase) |
| <p><b>ArabicShaping<br> |
| </b>Name<br> |
| <br> |
| <b>EastAsianWidth<br> |
| </b>Name<br> |
| <br> |
| <b>LineBreak<br> |
| </b>Name</p> |
| </td> |
| <td width="50%" valign="top"><b>PropList</b><font face="Times New Roman" |
| color="#000000"> |
| <p>The fields are based on the proposed PropList.alpha, which changes the |
| fields considerably.</p> |
| </font> |
| <p><span class="cb" id style="display: block"><b><i>WARNING: other values |
| are also likely to change!</i></b></span></p> |
| </td> |
| </tr> |
| </table> |
| <h4>2. Some fields are broken into several fields; others may be combined into a |
| single field</h4> |
| <ul> |
| <li><b>dt: </b>decomposition tag |
| <ul> |
| <li>the 'tag' field extracted from the decomposition mapping. If there is |
| no tag, the value is "canonical". Only has meaning if there is |
| a decomposition (<b>dm</b>).</li> |
| </ul> |
| </li> |
| <li><b>nt: </b>numeric type |
| <ul> |
| <li>an enumeration [decimal, digit, numeric] for the type of number. It |
| replaces having duplicate field values for numbers</li> |
| </ul> |
| </li> |
| <li><b>rg: </b>range |
| <ul> |
| <li>used for ranges of values that share characteristics, instead of |
| having to do a substring check.<br> |
| "START" corresponds to "<..., First>"<br> |
| "END" corresponds to "<..., Last>"</li> |
| </ul> |
| </li> |
| <li><b>nc: </b>name computed |
| <ul> |
| <li>if "COMPUTED", indicates that the name must be computed: |
| e.g. Hangul Syllables, Ideographs</li> |
| </ul> |
| </li> |
| <li><b>na: </b>name annotation |
| <ul> |
| <li>used for code points that do not really have associated names, like |
| control characters and private use characters. The data in that case is |
| either extracted from the "<...>" style name in the old |
| format, or gotten from the "1.0 Unicode name".</li> |
| </ul> |
| </li> |
| </ul> |
| |
| </body> |
| |
| </html> |