| <html> |
| |
| <head> |
| <meta http-equiv="Content-Language" content="en-us"> |
| <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> |
| <title>New Page 18</title> |
| </head> |
| |
| <body> |
| |
| <h1>UnicodeTools</h1> |
| <p>This file provides instructions for building and running the UnicodeTools, which<br> |
| can be used to:</p> |
| <ul> |
| <li>build the Derived Unicode files in the UCD (Unicode Character Database),</li> |
| <li>build the transformed UCA (Unicode Collation Algorithm) files needed by ICU.</li> |
| <li>run consistency checks on beta releases of the UCD and the UCA.</li> |
| <li>build 4 chart folders on the unicode site</li> |
| </ul> |
| <p><font color="#FF0000"><b>WARNING!!</b></font></p> |
| <ul> |
| <li>This is NOT production level code, and should never be used in programs.</li> |
| <li>The API is subject to change without notice, and will not be maintained.</li> |
| <li>The source is uncommented, and has many warts; since it is not production code, it has not |
| been worth the time to clean it up.</li> |
| <li>It will probably not work on Unix or Mac without changing the file separator.</li> |
| <li>Currently it uses hard-coded directory names.</li> |
| <li>The contents of multiple versions of the UCD must be copied to a local directory, as described |
| below.</li> |
| </ul> |
| <h2>Instructions:</h2> |
| <h3>0. You will need to get ICU4J on your system, using CVS.</h3> |
| <p>The rest of this will assume that you have set up CVS so that you load the ICU4J project into |
| C:\ICU4J<br> |
| <br> |
| You need both the main icu4j and a subproject called unicodetools. See: |
| <a href="http://ibm.com/software/globalization/icu/repository.jsp"> |
| http://ibm.com/software/globalization/icu/repository.jsp</a>. Inside unicodetools, look at com/ibm/text. The |
| main directories of interest are UCD, UCA and utility.</p> |
| <h4>0a. If you are using Eclipse for your IDE, look at the instructions on |
| <a href="http://icu.sourceforge.net/docs/eclipse_howto/eclipse_howto.html"> |
| http://oss.software.ibm.com/icu/docs/eclipse_howto/eclipse_howto.html</a> </h4> |
| <p>Set up Eclipse to build two projects: ICU4J and UnicodeTools:<br> |
| <br> |
| <b>Project Name: </b>ICU4J<br> |
| <b>Directory: </b>C:\ICU4J\icu4j<br> |
| <b>Default output folder = </b>ICU4J/classes<br> |
| <br> |
| <b>Project Name: </b>unicodetools<br> |
| <b>Create project from existing source: </b>C:\ICU4J\unicodetools<br> |
| <b>Default Output Folder: </b>unicodetools/classes<br> |
| <br> |
| After Eclipse is set up with these, exclude certain files from unicodetools:<br> |
| <br> |
| Right-Click UnicodeTools > Properties > Java Build Path > Exclusions<br> |
| com/ibm/rbm/<br> |
| com/ibm/text/utility/UnicodeMapInt.java<br> |
| com/ibm/text/utility/TestUtility.java<br> |
| com/ibm/text/UCD/GenerateThaiBreaks-old.java/<br> |
| com/ibm/text/UCD/ProcessUnihan.java/<br> |
| com/ibm/text/UCA/WriteHTMLCollation.java/<br> |
| <br> |
| UnicodeTools must also include the ICU4J project, with<br> |
| <br> |
| Right-Click UnicodeTools > Properties > Java Build Path > Projects</p> |
| <h3>1. In UCD, you must edit UCD_Types.java at the top, to set the directories for the build:</h3> |
| <p>public static final String DATA_DIR = "C:\\DATA\\";<br> |
| public static final String UCD_DIR = BASE_DIR + "UCD\\";<br> |
| public static final String BIN_DIR = DATA_DIR + "BIN\\";<br> |
| public static final String GEN_DIR = DATA_DIR + "GEN\\";<br> |
| <br> |
| Make sure that each of these directories exist. Also make sure that the following<br> |
| exist:<br> |
| <br> |
| <GEN_DIR>/DerivedData<br> |
| <GEN_DIR>/DerivedData/ExtractedProperties<br> |
| <UCD_DIR>/EXTRAS-Update</p> |
| <h3>2. Download all of the UnicodeData files for each version into UCD_DIR.</h3> |
| <p>The folder names must be of the form: "3.2.0-Update", so rename the folders on the<br> |
| Unicode site to this format. I<span style="background-color: #FFFF00">f the |
| folder contains ucd, then make the contents of that directory be the contents of |
| the x.x.x-Update directory. That is, each directory will directly contain files |
| like PropList....txt</span></p> |
| <h4>2a Ensure Complete Release</h4> |
| <p>If you are downloading any "incomplete" release (one that does not contain a complete set of data |
| files for that release, you need to also download the previous complete release). Most of the N.M-Update |
| directoriess are complete, *except*:</p> |
| <p>4.0-Update, which does not contain a copy of Unihan.txt and some other files<br> |
| 3.1-Update, which does not contain a copy of BidiMirroring.txt</p> |
| <p>Also, make the following changes to UnicodeData for 1.1.5:</p> |
| <p><b>Delete</b></p> |
| <pre>3400;HANGUL SYLLABLE KIYEOK A;Lo;0;L;1100 1161;;;;N;;;;; |
| ... |
| 4DFF;HANGUL SYLLABLE MIEUM WEO RIEUL-THIEUTH;Lo;0;L;1106 116F 11B4;;;;N;;;;; |
| 4E00;<cjk IDEOGRAPH REPRESENTATIVE>;Lo;0;L;;;;;N;;;;;</pre> |
| <p><b>Add:</b></p> |
| <pre>4E00;<cjk Ideograph, First>;Lo;0;L;;;;;N;;;;; |
| 9FA5;<cjk Ideograph, Last>;Lo;0;L;;;;;N;;;;; |
| E000;<private Use, First>;Co;0;L;;;;;N;;;;; |
| F8FF;<private Use, Last>;Co;0;L;;;;;N;;;;;</pre> |
| <p><b>And from a late version of Unicode, add:</b></p> |
| <pre>F900;CJK COMPATIBILITY IDEOGRAPH-F900;Lo;0;L;8C48;;;;N;;;;; |
| ... |
| FA2D;CJK COMPATIBILITY IDEOGRAPH-FA2D;Lo;0;L;9DB4;;;;N;;;;;</pre> |
| <h4>2b. UCA data</h4> |
| <p>If you are building any of the UCA tools, you need to get a copy of the UCA data file<br> |
| from http://www.unicode.org/reports/tr10/#AllKeys. The default location for this is:<br> |
| <br> |
| BASE_DIR + "Collation\allkeys" + VERSION + ".txt".<br> |
| <br> |
| If you have it in a different location, change that value for KEYS in UCA.java, and <br> |
| the value for BASE_DIR</p> |
| <h4>2c. Here is an example of the default directory structure with files. All of |
| the yellow ones should exist</h4> |
| <pre>C://DATA/ |
| |
| BIN/ |
| |
| <span style="background-color: #FFFF00"> Collation/ |
| allkeys-3.1.1.txt |
| </span> |
| GEN/ |
| DerivedData/ |
| <span style="background-color: #FFFF00"> </span><span style="background-color: #FFFF00">UCD/ |
| 3.0.0-Update/ |
| Unihan-3.2.0.txt |
| ... |
| 3.0.1-Update/ |
| ... |
| 3.1.0-Update/ |
| ... |
| 3.1.1-Update/ |
| ... |
| 3.2.0-Update/ |
| ... |
| 4.0.0-Update/ |
| ArabicShaping-4.0.0d14b.txt |
| BidiMirroring-4.0.0d1b.txt |
| ... |
| EXTRAS-Update/</span></pre> |
| <h3>3. Versions</h3> |
| <p>All of the following have "version X" in the options you give to Java (either on the |
| command line, or in the Eclipse 'run' options. If you want a specific version like 3.1.0, then you |
| would write "version 3.1.1". If you want the latest version (4.1.0), you can omit the "version X".</p> |
| <h3>4. Building Files</h3> |
| <ol> |
| <li><b>Setup</b><ol> |
| <li>In Eclipse, open the Package Explorer (Use Window>Show View if you |
| don't see it)</li> |
| <li>Open UnicodeTools<ul> |
| <li>com.ibm.text.UCD<ul> |
| <li>MakeUnicodeFiles.<span style="background-color: #FFFF00">txt</span><p>This file drives the production of |
| the derived Unicode files. The first three lines contain |
| parameters that you may want to modify at some times:</p> |
| <pre>Generate: <b>.*script.*</b> <i>// this is a regular expression. Use .* for all files</i> |
| DeltaVersion: <b>10</b> <i> // This gets appended to the file name. Pick 1+ the highest value in Public</i> |
| CopyrightYear: <b>2006</b> <i> // Pick the current year</i></pre> |
| </li> |
| </ul> |
| </li> |
| </ul> |
| </li> |
| <li>Open in Package Explorer |
| <ul> |
| <li>com.ibm.text.UCD<ul> |
| <li>Main</li> |
| </ul> |
| </li> |
| </ul> |
| </li> |
| <li>Run>Run As...<ol> |
| <li>Choose Java Application<ul> |
| <li>it will fail, don't worry; you need to set some parameters</li> |
| </ul> |
| </li> |
| </ol> |
| </li> |
| <li>Run>Run...<ul> |
| <li>Select the Arguments tab, and fill in the following<ul> |
| <li>Program arguments:<pre>build 5.0 MakeUnicodeFiles</pre> |
| </li> |
| <li>VM arguments: |
| <pre>-Xms512m -Xmx512m</pre> |
| </li> |
| </ul> |
| </li> |
| <li>Close and Save</li> |
| </ul> |
| </li> |
| </ol> |
| </li> |
| <li><b>Run</b><ol> |
| <li>You'll see it build the 5.0 files, with something like the following |
| results:<pre>Writing UCD_Data5.0.0 |
| Data Size: 109,802 |
| Wrote Data 109802</pre> |
| </li> |
| <li>For each version, the tools build a set of binary data in BIN that |
| contain the information for that release. This is done automatically, or |
| you can manually do it with the Program Arguments<pre>version X build</pre> |
| <p>This builds an compressed format of all the UCD data (except blocks |
| and Unihan) into the BIN directory. Don't worry about the voluminous |
| console messages, unless one says "FAIL".</p> |
| <p><font color="#FF0000"><i>You have to manually do this if you change |
| any of the data files in that version!</i></font></p> |
| <p>Note: if for any reason you modify the binary format of the BIN files, you also have to bump the |
| value in that file:</p> |
| <pre>static final byte BINARY_FORMAT = 8; // bumped if binary format of UCD changes</pre> |
| </li> |
| </ol> |
| </li> |
| <li>Results in <a href="file:///C:/DATA/GEN/DerivedData"> |
| C:\DATA\GEN\DerivedData</a><ol> |
| <li>The files will be in this directory.</li> |
| <li>There are also DIFF folders, that contain BAT files that you can run |
| on Windows with CompareIt. (You can modify the code to build BATs with |
| another Diff program if you want).<ol> |
| <li>For any file with a significant difference, it will build two |
| BAT files, such as the first two below.<pre>Diff_PropList-5.0.0d10.txt.bat |
| OLDER-Diff_PropList-5.0.0d10.txt.bat |
| |
| UNCHANGED-Diff_PropertyValueAliases-5.0.0d10.txt.bat</pre> |
| </li> |
| </ol> |
| </li> |
| <li>Any files without significant changes will have "UNCHANGED" as a |
| prefix: ignore them. The OLDER prefix is the comparison to the |
| last version of Unicode.</li> |
| <li>On Windows you can run these BATs to compare files:</li> |
| </ol> |
| </li> |
| </ol> |
| <h3>5. Invariant Checking</h3> |
| <ol> |
| <li>Setup<ol> |
| <li>Open in Package Explorer<ul> |
| <li>com.ibm.text.UCD<ul> |
| <li>TestUnicodeInvariants.java</li> |
| </ul> |
| </li> |
| </ul> |
| </li> |
| <li>Run>Run As... Java Application<br> |
| Will create the following file of results:<pre><a href="file:///C:/DATA/GEN/UnicodeInvariantResults.txt/">C:\DATA\GEN\UnicodeInvariantResults.txt\</a></pre> |
| </li> |
| <li>Open that file and search for "**** START Error Info ****" Each such |
| point provides a dump of comparison information.</li> |
| </ol> |
| </li> |
| </ol> |
| <h3>6. Options</h3> |
| <ol> |
| <li>If you want to see files that are opened while processing, do the |
| following:<ol> |
| <li>Run>Run</li> |
| <li>Select the Arguments tab, and add the following<ol> |
| <li>VM arguments: |
| <pre>-DSHOW_FILES</pre> |
| </li> |
| </ol> |
| </li> |
| </ol> |
| </li> |
| </ol> |
| <h3>5. UCA</h3> |
| <ol> |
| <li> |
| <h3>You will use com.ibm.text.UCA.Main as your main class, creating along |
| the same lines as above.</h3></li> |
| <li> |
| <h4>To build all the UCA files used by ICU, use the Program arguments:</h4> |
| <pre>Main ICU</pre> |
| </li> |
| <li> |
| <h4>To build all the charts, use the UCA project, with options: </h4> |
| <pre>normalizationChart caseChart scriptChart indexChart</pre> |
| </li> |
| </ol> |
| |
| </body> |
| |
| </html> |