| <html lang="en"> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=us-ascii"> |
| <title>GB 18030</title> |
| </head> |
| |
| <!-- Copyright (C) 2000, International Business Machines Corporation and others. All Rights Reserved. --> |
| |
| <body> |
| <h1>GB 18030</h1> |
| <p align="right">Markus Scherer, 2000-nov-30</p> |
| |
| <p>GB 18030 is a new Chinese codepage standard, published 2000-mar-17, that is designed for |
| <ul> |
| <li>Upwards compatibility with the GB 2312-1980 standard</li> |
| <li>Compatibility with the GBK specification, updated for Unicode 3.0</li> |
| <li>Full coverage of all Unicode code points similar to a UTF</li> |
| </ul> |
| After discussions between the Chinese standards agency and IT companies, GB 18030 was <em>republished</em>. |
| On 2000-nov-30, a modified mapping table file was released, and |
| the text of the standard is expected to be republished in December.</p> |
| |
| <p>Byte sequence structure: |
| <ul> |
| <li>Single-byte: 00-7f</li> |
| <li>Two-byte: 81-fe | 40-7e, 80-fe</li> |
| <li>Four-byte: 81-fe | 30-39 | 81-fe | 30-39</li> |
| </ul></p> |
| |
| <p>Special properties of GB 18030: |
| <ul> |
| <li>Huge: 1.6 million codepage code points — probably the largest codepage</li> |
| <li>Similar to UTF: All 1.1 million Unicode code points U+0000-U+10ffff |
| except for surrogates U+d800-U+dfff |
| map to and from GB 18030 codes.</li> |
| <li>Most of these mappings, except for parts of the BMP, can be done algorithmically. |
| This makes it an unusual mix of a Unicode encoding with a traditional codepage.</li> |
| <li>It is not possible for all codepage byte sequences to determine the length of |
| the sequence from the first byte. This is unusual.</li> |
| </ul></p> |
| |
| <h2>Generating a GB 18030 mapping table</h2> |
| |
| <p>GB 18030 is derived from existing standards and specifications, |
| and a mapping table can be generated from existing data with modifications.<br> |
| On 2000-nov-30, the Chinese standards agency released a mapping table that differs |
| from the original specification from 2000-mar-17. It changes all four-byte GB sequences |
| for Unicode BMP code points, removes any mappings for single surrogates, |
| removes all fallback mappings, and changes some mappings to further update the GBK portion to Unicode 3.0.</p> |
| |
| <p>This following description illustrates the genesis and structure of GB 18030. |
| The actual data that is included here contains the actual one- and two-byte GB 18030 mappings |
| for Unicode BMP code points as released by the Chinese standards agency. |
| To skip the historical discussion, continue at the <a href="#officialdata">discussion of released data</a>.</p> |
| |
| <p>Historical discussion based on the specification from 2000-mar-17: |
| <ol> |
| <li>GBK is a specification (not a standard) that is an extension of GB 2312-1980 |
| to cover the ideographs in Unicode 2.0. Microsoft co-authored GBK. |
| Get a GBK table, e.g. the one for Microsoft Windows 2000 codepage 936 |
| from <a href="http://oss.software.ibm.com/icu/charset/">ICU sample charsets</a>. |
| (There are <a href="http://oss.software.ibm.com/icu/charset/CharMaps-UCM/">.ucm files</a> |
| including <a href="http://oss.software.ibm.com/icu/charset/CharMaps-UCM/windows-936-2000.ucm">windows-936-2000.ucm</a>.)</li> |
| <li>From the Microsoft codepage table, remove all fallback mappings and the one for GB+ff. |
| Note that the Windows 2000 version contains the Euro sign at GB+80=U+20ac. |
| Leave it in there for GB 18030.</li> |
| <li>Get a copy of appendix E of the GB 18030 standard. |
| There are 79 characters with "temporary" and "new" Unicode mappings. |
| The temporary ones map to private-use code points because the characters were not assigned in Unicode 2.0. |
| In the data, change them from roundtrip mappings to fallbacks. |
| The new mappings are to Unicode 3.0 code points. |
| Add them as roundtrip mappings to your data.</li> |
| <li>U+0080 is not currently mapped by the standard. |
| Also, there is a small number of known errors, typos, and ambiguities in the original standard publication. |
| See <a href="ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf">this summary</a>. |
| I have added U+0080=GB+8432eb38 to my data, corrected GB+fe5e=U+2e97, and added fallbacks from GB to Unicode |
| for the four-byte codes for U+2e97 and U+303e. |
| This is not official at this point!</li> |
| <li>You should arrive at data like <a href="gbkuni30.txt">gbkuni30.txt</a>. |
| This file has the following simplified format on each line:<br> |
| <code>unicode (':' | '>' | '<') gb ['*' ['*']]</code><br> |
| The left column contains the Unicode code point, the right column the byte sequence in GB 18030. |
| The delimiter is either a colon for roundtrip mappings or a greater-than sign |
| for fallbacks from Unicode to the codepage, or a less-than sign for fallbacks from the codepage to Unicode. |
| I have marked mappings of the appendix E characters with a star. |
| In addition, I have marked mappings that <em>should be</em> in appendix E with a double star.</li> |
| </ol> |
| </p> |
| |
| <p><a name="officialdata"></a> |
| The above description explains the state of the data from 2000-mar-17 with corrections. |
| On 2000-nov-30, a new mapping table was released that is now the base for the supplied <a href="gbkuni30.txt">gbkuni30.txt</a>. |
| Compared with the above, it |
| <ul> |
| <li>includes explicit mappings for the ASCII characters</li> |
| <li>corrects the mappings for U+2e97 and U+303e</li> |
| <li>removes any mappings for single surrogates</li> |
| <li>removes all fallback mappings, specifying only roundtrip mappings between GB 18030 and Unicode 3.0</li> |
| <li>changes the mapping for the Euro sign and removes the fullwidth Euro sign (GB+80 is not used any more)</li> |
| <li>re-enumerates all four-byte GB sequences for Unicode BMP code points from U+0080</li> |
| </ul> |
| This results in a new codepage definition that removes backwards compatibility with GBK for some 80 characters |
| because of the lack of fallback mappings. |
| The re-enumeration of all four-byte mappings for the Unicode BMP means that about 40000 mappings change from |
| the original specification, including about 25000 mappings for characters that are assigned in Unicode 3.0.</p> |
| |
| <p>This is how the data is prepared for use with ICU, as was done for the actual ICU implementation of the GB 18030 converter: |
| <ol> |
| <li>Compile <a href="gbmake4.c">gbmake4</a> and run it with the above file as stdin input. |
| You will get as output all the four-byte mappings for all |
| BMP code points that do not have a one-byte or two-byte mapping.</li> |
| <li>All Unicode code points on the supplementary planes, U+10000-U+10ffff, are mapped as well. |
| Their GB 18030 codes are four-byte sequences starting at GB+90308130. |
| You can enumerate them lexically by keeping the second and fourth bytes |
| between 0x30 and 0x39 and the third byte between 0x81 and 0xfe. For example:</li> |
| <pre> |
| U+10000=GB+90308130 |
| U+10001=GB+90308131 |
| U+10002=GB+90308132 |
| ... |
| U+1000a=GB+90308230 |
| U+1000b=GB+90308231 |
| ... |
| U+10ffff=GB+e3329a35 |
| </pre> |
| You can calculate linear values and differences between GB 18030 four-byte sequences |
| with <a href="lineargb.c">lineargb</a>. |
| <li>Done! The result is a set of 0x110000 mappings!</li> |
| <li>Of course, an economic implementation would handle the mappings for the |
| supplementary planes algorithmically. |
| Also, large parts of the BMP mappings are contiguous and can be |
| handled similarly. For an ICU MBCS converter, U+fffe and U+ffff should |
| in any case be special-cased because these values have special meaning in .cnv files.</li> |
| <li>You can have gbmake4 generate a list of contiguous four-byte ranges in the BMP. |
| Run it with the same input but specify "r" as an argument. |
| Sort the output descending. |
| Select the ranges that you deem useful, add the one including U+fffe and U+ffff. |
| For example, see <a href="ranges.txt">ranges.txt</a>.</li> |
| <li>If you concatenate gbkuni30.txt and your selected ranges including the |
| "ranges" line in between, you can run this through gbmake4 again and |
| get a mapping table without the code points in the ranges.</li> |
| <li>For an ICU converter, turn your data into a .ucm file and |
| add the header information. |
| Keep the roundtrip/fallback information: |
| roundtrip mappings (':') need a trailing "|0", fallback mappings ('>') a trailing "|1". |
| You can use <a href="gbtoucm.c">gbtoucm</a>.<br> |
| Use <a href="gbucm.bat">gbucm.bat</a> to generate such a file from the data provided here.</li> |
| <li>Also for an ICU MBCS converter, you need to specify a state table for the codepage |
| that describes its structure. For example, with the supplementary planes and the |
| <a href="ranges.txt">suggested ranges</a> handled algorithmically and therefore |
| declared as "unassigned", see this <a href="gbstates.txt">sample state table</a>. |
| (Tip: for determining which bytes can be marked as unassigned for the ranges, |
| it helps to sort the ranges by byte sequence values.)</li> |
| <li>All valid four-byte codepage code points that do not map to |
| any Unicode code point are of course unassigned. |
| This includes 9012 sequences with a 0x84 lead byte and 9824 with a 0xe3 lead byte, |
| as well as about 0.5 million with lead bytes 0x85..0x8f and 0xe4..0xfe.</li> |
| </ol> |
| For comparison with other data files, it is possible to reformat a mapping file so that it |
| contains only mappings from Unicode to GB or only from GB to Unicode. |
| <a href="gbsingle.c">gbsingle</a> takes input in the simple format of gbkuni30.txt |
| and outputs even simpler format:<br> |
| With no command line argument, it writes<br> |
| <code>unicode ':' gb</code><br> |
| including the fallback mappings from Unicode to the codepage. |
| With a "gb" argument, it writes<br> |
| <code>gb ':' unicode</code><br> |
| including the fallback mappings from the codepage to Unicode.<br> |
| Use <a href="gbbmp.bat">gbbmp.bat</a> to generate such files (and one combined file) |
| for all BMP code points from the data provided here. |
| </p> |
| |
| </body> |
| </html> |