| # Conversion Data |
| |
| ## Introduction |
| |
| ### Algorithmic vs. Data-based |
| |
| In a comprehensive conversion library, there are three kinds of codepage |
| converter implementations: converters that use algorithms, mapping data, or |
| those converters that use both. |
| |
| 1. Most codepages have a simple and straightforward structure but have an |
| arbitrary relationship between input and output character codes. Mapping |
| tables are necessary to define the conversion. If the codepage characters |
| use more than one byte each, then the mapping table must also define the |
| structure of the codepage. |
| |
| 2. Algorithmic converters work by transforming the input stream with built-in |
| algorithms and possibly small, hard coded tables. The conversion can be |
| complex, but the actual mapping of a character code is done numerically if |
| the converter is purely algorithmic. |
| |
| 3. In some cases, a converter needs to be algorithmic for its basic operations |
| but also relies on mapping data. |
| |
| ICU provides converter implementations for all three groups of codepages. Since |
| ICU always converts, to or from Unicode, the purely algorithmic converters are |
| the ones for Unicode encodings (such as UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, |
| UTF-32LE, SCSU, BOCU-1 and UTF-7). Since Unicode is based on US-ASCII and |
| ISO-8859-1 ("ISO Latin-1"), these encodings also use algorithmic converters for |
| performance reasons. |
| |
| Most other codepages use simple byte sequences but are not encodings of Unicode. |
| They are converted with generic code using mapping data tables. ICU also |
| supports a few encodings, like ISO-2022 and its variants, that employ an |
| algorithmic structure to switch between a set of codepages. The converters for |
| these encodings are algorithmic but use mapping tables for the embedded |
| codepages. |
| |
| ### Stateful vs. Stateless |
| |
| Character encodings are either stateful or stateless: |
| |
| 1. Stateless encodings define a byte sequence for each character. Complete |
| character byte sequences can be used in any order, and the same complete |
| character byte sequences always encodes the same characters. It is |
| preferable to always encode one character using the same byte sequence. |
| |
| 2. Stateful encodings define byte sequences that change the state of the text |
| stream. Depending on the current state, the same byte sequence may encode a |
| different character and the same character may be encoded with different |
| byte sequences. |
| |
| This distinction between stateless and stateful encodings is important, because |
| it determines if any available ICU converter implementation is used. The |
| following are some more important considerations related to stateless versus |
| stateful encodings: |
| |
| 1. A runtime converter object is always stateful, even for "stateless" |
| encodings. They are always stateful because an input buffer may end with a |
| partial byte sequence that is to be continued in the next input buffer in |
| the following conversion call. The information about this is stored in the |
| converter object. Similarly, if the input is Unicode text, then an input |
| buffer may end with the first of a pair of surrogates. The converter object |
| also stores overflow bytes or code units if the result of a character |
| mapping did not fit entirely into the output buffer. |
| |
| 2. Stateless encodings are stateful in our converter implementation to |
| interpret "complete byte sequences". They are "stateful" because many |
| encodings can have the same byte value used in different positions of byte |
| sequences for different characters; a specific byte value may be a lead byte |
| or a trail byte. For instance, the lead and trail byte values overlap in |
| codepages like Shift-JIS. If a program does not start reading at a character |
| boundary, it may instead interpret the byte sequences from two or more |
| separate characters as one character. Often, character boundaries can be |
| detected reliably only by reading the non-Unicode text linearly from the |
| beginning. This can be a problem for non-Unicode text processing, where text |
| insertion, deletion, and searching are common. The UTF-8/16/32 encodings do |
| not have this problem because the single, lead, or trail units have disjoint |
| values and character boundary can be easily found. |
| |
| 3. Some stateful encodings only switch between two states: one with one byte |
| per character and one with two bytes per character. This type of encoding is |
| very common in mainframe systems based on Extended Binary Coded Decimal |
| Interchange Code (EBCDIC) and is actually handled in ICU with almost the |
| same code and type of mapping tables as stateless codepages. |
| |
| 4. The classifications of algorithmic vs. data-based converters and of |
| stateless vs. stateful encodings are independent of each other: UTF-8, |
| UTF-16, and UTF-32 encodings are algorithmic but stateless; UTF-7 and SCSU |
| encodings are algorithmic and stateful; Windows-1252 and Shift-JIS encodings |
| are data-based and stateless; ISO-2022-JP encoding is algorithmic, |
| data-based, and stateful. |
| |
| ### Scope of this chapter |
| |
| The following sections in this chapter discuss the mapping data tables that are |
| used in ICU. For related material, please see: |
| |
| 1. [ICU character set collection](http://icu-project.org/charts/charset/) |
| |
| 2. [Unicode Technical Report 22](http://www.unicode.org/unicode/reports/tr22/) |
| |
| 3. "Cross Mapping Tables" in [Unicode Online |
| Data](http://www.unicode.org/unicode/onlinedat/online.html) |
| |
| ## ICU Mapping Table Data Files |
| |
| ### Overview |
| |
| As stated above, most ICU converters rely on character mapping tables. ICU 1.8 |
| has one single data structure for all character mapping tables, which is used by |
| a generic Multi-Byte Character Set (MBCS) converter implementation. The |
| implementation is flexible enough to handle stateless encodings with the |
| following parameters: |
| |
| 1. Support for variable-length, byte-based encodings with 1 to 4 bytes per |
| character. |
| |
| 2. Support for all Unicode characters (code points 0..0x10ffff). Since ICU 1.8 |
| uses the UTF-16 encoding as its Unicode encoding form, surrogate pairs are |
| completely supported. |
| |
| 3. Efficient distinction between unassigned (unmappable) and illegal byte |
| sequences. |
| |
| 4. It is not possible to convert from Unicode to byte sequences with leading |
| zero bytes. |
| |
| 5. Simple stateful encodings are also handled using only Shift-In and Shift-Out |
| (SI/SO) codes and one single-byte and one double-byte state. |
| |
| *In the context of conversion tables, "unassigned" code points or codepage byte |
| sequences are valid but do not have a **mapping**. This is different from |
| "unassigned" code points in a character set like Unicode or Shift-JIS which are |
| codes that do not have assigned **characters**.* |
| |
| Prior to version 1.8, ICU used more specific, more limited, converter |
| implementations for Single Byte Character Set (SBCS), Double Byte Character Set |
| (DBCS), and the stateful Extended Binary Coded Decimal Interchange Code (EBCDIC) |
| codepages. Mapping table data is provided in text files. ICU comes with several |
| dozen .ucm files (UniCode Mapping, in icu/source/data/mappings/) that are |
| translated at build time by its makeconv tool (source code in |
| icu/source/tools/makeconv). The makeconv tool writes one binary, memory-mappable |
| .cnv file per .ucm file. The resulting .cnv files are included by default in the |
| common data file for use at runtime. |
| |
| The format of the .ucm files is similar to the format of the UPMAP files as |
| provided by IBM® in the codepage repository and as used in the uconvdef tool on |
| AIX. UPMAP is a text file that specifies the mapping of a codepage character to |
| and from Unicode. |
| |
| The format of the .cnv files is ICU-specific. The .cnv file format may change |
| between ICU versions even for the same .ucm files. The .ucm file format may be |
| extended to include more features. |
| |
| The following sections concentrate on the .ucm file format. The .cnv file format |
| is described in the source code in the icu/source/common/ucnvmbcs.c directory |
| and is updated using the MBCS converter implementation. |
| |
| These conversion tables can have more than one name. ICU allows multiple names |
| ("aliases") for the same encoding. It matches a requested encoding name against |
| a list of names in icu/source/data/mappings/convrtrs.txt and when it finds a |
| match, ICU opens a converter with the name in the leftmost position in the |
| matching line. The name matching is not case-sensitive and ICU ignores spaces, |
| dashes, and underscores. At build time, the gencnval tool located in the |
| icu/source/tools/gencnval directory, generates a binary form of the convrtrs.txt |
| file as a data file for runtime for the cnvalias.icu file ("Converter Aliases |
| data file"). |
| |
| ### .ucm File Format |
| |
| .ucm files are line-oriented text files. Empty lines and comments starting with |
| '#' are ignored. |
| |
| A .ucm file contains two sections: |
| |
| 1. a header with general specifications of the codepage |
| |
| 2. a mapping table section between the "CHARMAP" and "END CHARMAP" lines. |
| |
| For example: |
| |
| <code_set_name> "IBM-943" |
| <char_name_mask> "AXXXX" |
| <mb_cur_min> 1 |
| <mb_cur_max> 2 |
| <uconv_class> "MBCS" |
| <subchar> \\xFC\\xFC |
| <subchar1> \\x7F |
| <icu:state> 0-7f, 81-9f:1, a0-df, e0-fc:1 |
| <icu:state> 40-7e, 80-fc |
| # |
| CHARMAP |
| # |
| # |
| #ISO 10646 IBM-943 |
| #_________ _________ |
| <U0000> \\x00 |0 |
| <U0001> \\x01 |0 |
| <U0002> \\x02 |0 |
| <U0003> \\x03 |0 |
| ... |
| <UFFE4> \\xFA\\x55 |1 |
| <UFFE5> \\x81\\x8F |0 |
| <UFFFD> \\xFC\\xFC |2 |
| END CHARMAP |
| |
| The header fields are: |
| |
| 1. code_set_name - The name of the codepage. The makeconv tool generates the |
| .cnv file name from the .ucm filename but uses this header field for the |
| converter name that it writes into the .cnv file for ucnv_getName. The |
| makeconv tool prints a warning message if this header field does not match |
| the file name. The file name is not case-sensitive. |
| |
| 2. char_name_mask - This is ignored by makeconv tool. "AXXXX" specifies that |
| the POSIX-style character "name" consists of one letter (Alpha) followed by |
| 4 hexadecimal digits. Since ICU only uses Unicode character "names" (for |
| example, code points) the format is fixed (see below). |
| |
| 3. mb_cur_min - The minimum number of bytes per character. |
| |
| 4. mb_cur_max - The maximum number of bytes per character. |
| |
| 5. uconv_class - This can be either "SBCS", "DBCS", "MBCS", or |
| "EBCDIC_STATEFUL" |
| The most general converter class/type/category is MBCS, which requires that |
| the codepage structure has the following <icu:state> lines. The other types |
| of converters are subsets of MBCS. The makeconv tool uses predefined state |
| tables for these other converters when their structure is not explicitly |
| specified. The following describes how the converter types are interpreted: |
| |
| 1. MBCS: Generic ICU converter type, requires a state table |
| |
| 2. SBCS: Single-byte, 8-bit codepages |
| |
| 3. DBCS: Double-byte EBCDIC codepages |
| |
| 4. EBCDIC_STATEFUL: Mixed Single-Byte or Double-Byte EBCDIC codepages |
| (stateful, using SI/SO) |
| |
| The following shows the exact implied state tables for non-MBCS types. A state |
| table may need to be overwritten in order to allow supplementary characters |
| (U+10000 and up). |
| |
| 1. subchar - The substitution character byte sequence for this codepage. This |
| sequence must be a valid byte sequence according to the codepage structure. |
| |
| 2. subchar1 - This is the single byte substitution character when subchar is |
| defined. Some IBM converter libraries use different substitution characters |
| for "narrow" and "wide" characters (single-byte and double-byte). ICU uses |
| only one substitution character per codepage because it is common industry |
| practice. |
| |
| 3. icu:state - See the "State Table Syntax in .ucm Files" section for a |
| detailed description of how to specify a codepage structure. |
| |
| 4. icu:charsetFamily - This specifies if the codepage is ASCII or EBCDIC based. |
| |
| The subchar and subchar1 fields have been known to cause some confusion. The |
| following conditions outline when each are used: |
| |
| 1. Conversion from Unicode to a codepage occurs and an unassigned code point is |
| found |
| |
| 1. If a subchar1 byte is defined and a subchar1 mapping is defined for the |
| code point (with a |2 precision indicator), output the subchar1 |
| |
| 2. Otherwise output the regular subchar |
| |
| 2. Conversion from a codepage to Unicode occurs and an unassigned codepoint is |
| found |
| |
| 1. If the input sequence is of length 1 and a subchar1 byte is specified |
| for the codepage, output U+001A |
| |
| 2. Otherwise output U+FFFD |
| |
| In the CHARMAP section of a .ucm file, each line contains a Unicode code point |
| (like <U*(1-6 hexadecimal digits for the code point)*> ), a codepage character |
| byte sequence (each byte like \\x*hh* (2 hexadecimal digits} ), and an optional |
| "precision" or "fallback" indicator. |
| |
| The precision indicator either must be present in all mappings or in none of |
| them. The indicator is a pipe symbol ‘|’ followed by a 0, 1, 2, 3, or 4 that has |
| the following meaning: |
| |
| * |0 - A "normal", roundtrip mapping from a Unicode code point and back. |
| * |1 - A "fallback" mapping only from Unicode to the codepage, but not back. |
| * |2 – A subchar1 mapping. The code point is unmappable, and if a substitution |
| is performed, then the subchar1 should be used rather than the subchar. |
| Otherwise, such mappings are ignored. |
| * |3 - A "reverse fallback" mapping only from the codepage to Unicode, but not |
| back to the codepage. |
| * |4 - A "good one-way" mapping only from Unicode to the codepage, but not |
| back. |
| |
| Fallback mappings from Unicode typically do not map codes for the same |
| character, but for "similar" ones. This mapping is sometimes done if a character |
| exists in Unicode but not in the codepage. To replace it, ICU maps a codepage |
| code to a similar-looking code for human-readable output. This mapping feature |
| is not useful for text data transmission especially in markup languages where a |
| Unicode code point can be escaped with its code point value. The ICU application |
| programming interface (API) ucnv_setFallback() controls this fallback behavior. |
| |
| "Reverse fallbacks" are technically similar, but the same Unicode character can |
| be encoded twice in the codepage. ICU always uses reverse fallbacks at runtime. |
| |
| A subset of the fallback mappings from Unicode is always used at runtime: Those |
| that map private-use Unicode code points. Fallbacks from private-use code points |
| are often introduced as replacements for previous roundtrip mappings for the |
| same pair of codes. These replacements are used when a Unicode version assigns a |
| new character that was previously mapped to that private-use code point. The |
| mapping table is then changed to map the same codepage byte sequence to the new |
| Unicode code point (as a new roundtrip) and the mapping from the old private-use |
| code point to the same codepage code is preserved as a fallback. |
| |
| A "good one-way" mapping is like a fallback, but ICU always uses "good one-way" |
| mappings at runtime, regardless of the fallback API flag. |
| |
| The idea is that fallbacks normally lose information, such as mapping from a |
| compatibility variant of a letter to the ASCII version; however, fallbacks from |
| PUA and reverse fallbacks are assumed to be for "the same character", just an |
| older code for it. |
| |
| Something similar happens with from-Unicode Variation Selector sequences. It is |
| possible to round-trip (|0) either the unadorned character or the sequence with |
| a variation selector, and add a "good one-way" mapping (|4) from the other |
| version. That "good one-way" mapping does not lose much information, and it is |
| used even if the "use fallback" API flag is false. Alternatively, both mappings |
| could be fallbacks (|1) that should be controlled by the "use fallback" |
| attribute. |
| |
| ### State table syntax in .ucm files |
| |
| The conversion to Unicode uses a state machine to achieve the above capabilities |
| with reasonable data file sizes. The state machine information itself is loaded |
| with the conversion data and defines the structure of the codepage, including |
| which byte sequences are valid, unassigned, and illegal. This data cannot (or |
| not easily) be computed from the pure mapping data. Instead, the .ucm files for |
| MBCS encodings have additional entries that are specific to the ICU makeconv |
| tool. The state tables for SBCS, DBCS, and EBCDIC_STATEFUL are implied, but they |
| can be overridden (see the examples below). These state tables are specified in |
| the header section of the .ucm file that contains the <icu:state> element. Each |
| line defines one aspect of the state machine. The state machine uses a table of |
| as many rows as there are states (= as many as there are <icu:state> lines). |
| Each row has 256 entries; one for each possible byte value. |
| |
| The state table lines in the .ucm header conform to the following Extended |
| Backus-Naur Form (EBNF)-like grammar (whitespace is allowed between all tokens): |
| |
| row=\[\[firstentry ','\] entry (',' entry)\*\] |
| firstentry="initial" | "surrogates" |
| (initial state (default for state 0), output is all surrogate pairs) |
| |
| Each state table row description (that follows the <icu:state>) begins with an |
| optional initial or surrogates keyword and is followed by one or more column |
| entries. For the purpose of codepage state tables, the states=rows in the table |
| are numbered beginning at 0 for the first line in the .ucm file header. The |
| numbers are assigned implicitly by the makeconv tool in order of the <icu:state> |
| lines. |
| |
| A row may be empty (nothing following the <icu:state>) — that is equivalent to |
| "all illegal" or 0-ff.i and is useful for trail byte states for all-illegal byte |
| sequences. |
| |
| entry=range \[':' nextstate\] \['.' \[action\]\] |
| range = number \['-' number\] |
| nextstate = number (0..7f) |
| action = 'u' | 's' | 'p' | 'i' |
| (unassigned, state change only, surrogate pair, illegal) |
| number = (1- or 2-digit hexadecimal number) |
| |
| Each column entry contains at least one hexadecimal byte value or value range |
| and is separated by a comma. The column entry specifies how to interpret an |
| input byte in the row's state. If neither a next state nor an action is |
| explicitly specified (only the byte range is given) then the byte value |
| terminates the byte sequence, results in a valid mapping to a Unicode BMP |
| character, and resets the state number to 0. The first line with <icu:state> is |
| called state 0. |
| |
| The next state can be explicitly specified with a separating colon ( : ) |
| followed by the number of the state (=number/index of the row, starting at 0). |
| This specification is mostly used for intermediate byte values (such as bytes |
| that are not the last ones in a sequence). The state machine needs to proceed to |
| the next state and read another byte. In this case, no other action is |
| specified. |
| |
| If the byte value(s) terminate(s) a byte sequence, then the byte sequence |
| results in the following depending on the action that is announced with a period |
| ( . ) followed by a letter: |
| |
| letter meaning u |
| |
| Unassigned. The byte sequence is valid but does not encode a character. |
| |
| none |
| |
| (no letter) - Valid. If no action letter is specified, then the byte sequence is |
| valid and encodes a Unicode character up to U+ffff |
| |
| p |
| |
| Surrogate Pair. The byte sequence is valid and the result may map to a UTF-16 |
| encoded surrogate pair |
| |
| i |
| |
| Illegal. The byte sequence is illegal. This is the default for all byte values |
| in a row that are not otherwise specified with column entries |
| |
| s |
| |
| State change only. The byte sequence does not encode any character but may |
| change the state number. This may be used with simple, stateful encodings (for |
| example, SI/SO codes), but currently it is not used by ICU. |
| |
| If an action is specified without a next state, then the next state number |
| defaults to 0. In other words, a byte value (range) terminates a sequence if |
| there is an action specified for it, or when there is neither an action nor a |
| next state. In this case, the byte value defaults to "valid, next state is 0" |
| (equivalent to :0.). |
| |
| If a byte value is not specified in any column entry row, then it is illegal in |
| the current state. If a byte value is specified in more than one column entry of |
| the same row, then ICU uses the last state. These specifications allow you to |
| assign common properties for a wide byte value range followed by a few |
| exceptions. This is easier than having to specify mutually exclusive ranges, |
| especially if many of them have the same properties. |
| |
| The optional keyword at the beginning of a state line has the following effect: |
| |
| keyword effect initial The state machine can start reading byte sequences in |
| this state. State 0 is always an initial state. Only initial states can be next |
| states for final byte values. In an initial state, the Unicode mappings for all |
| final bytes are also stored directly in the state table. surrogates All Unicode |
| mappings for final bytes in non-initial states are stored in a separate table of |
| 16-bit Unicode (UTF-16) code units. Since most legacy codepages map only to |
| Unicode code points up to U+ffff (the Basic Multilingual Plane, BMP), the |
| default allocation per mapping result is one 16-bit unit. Individual byte values |
| can be specified to map to surrogate pairs (= two 16-bit units) with action |
| letter p. The surrogates keyword specifies the values for the entire state |
| (row). Surrogate pair mapping entries can still hold single units depending on |
| the actual mapping data, but single-unit mapping entries cannot hold a pair of |
| units. Mapping to single-unit entries is the default because the mapping is |
| faster, uses half as much memory in the code units table, and is sufficient for |
| most legacy codepages. |
| |
| When converting to Unicode, the state machine starts in state number 0. In each |
| iteration, the state machine reads one input (codepage) byte and either proceeds |
| to the next state as specified, or treats it as a final byte with the specified |
| action and an optional non-0 next (initial) state. This means that a state table |
| needs to have at least as many state rows as the maximum number of bytes per |
| character, which is the maximum length of any byte sequence. |
| |
| Exception: For EBCDIC_STATEFUL codepages, double-byte sequences start in state |
| 1, with the SI/SO bytes switching from state 0 to state 1 or from state 1 to |
| state 0. See the default state table below. |
| |
| ### Extension and delta tables |
| |
| ICU 2.8 adds an additional "extension" data structure to its conversion tables. |
| The new data structure supports a number of new features. When any of the |
| following features are used, then all mappings must use a precision indicator. |
| |
| #### Converting multiple characters as a unit |
| |
| Before ICU 2.8, only one Unicode code point could be converted to or from one |
| complete codepage byte sequence. The new data structure supports the conversion |
| between multiple Unicode code points and multiple complete codepage byte |
| sequences. (A "complete codepage byte sequence" is a sequence of bytes which is |
| valid according to the state table.) |
| |
| Syntax: Simply write more than one Unicode code point on a mapping line, and/or |
| more than one complete codepage byte sequence. Plus signs (+) are optional |
| between code points and between bytes. For example, |
| ibm-1390_P110-2003.ucm contains |
| |
| <U304B><U309A> \\xEC\\xB5 |0 |
| |
| and test3.ucm contains |
| |
| <U101234>+<U50005>+<U60006> \\x07+\\x00+\\x01\\x02\\x0f+\\x09 |0 |
| |
| For more examples see the ICU conversion data and the |
| icu/source/test/testdata/test\*.ucm test data files. |
| |
| ICU 2.8 supports up to 19 UChars on the Unicode side of a mapping and up to 31 |
| bytes on the codepage side. |
| |
| The longest match possible is converted in order to properly handle tables where |
| the source sides of some mappings are prefixes of the source sides of other |
| mappings. |
| |
| As a side effect, if conversion offsets are written and a potential match |
| crosses buffer boundaries, then some of the initial offsets for the following |
| output may be unknown (-1) because their input was stored in the converter from |
| a previous buffer while looking for a longer match. |
| |
| Conversion tables for SI/SO-stateful (usually EBCDIC_STATEFUL) codepages cannot |
| include mappings with SI or SO bytes or where there are SBCS characters in a |
| multi-character byte sequence. In other words, for these tables there must be |
| exactly one byte in a mapping or else a sequence of one or more DBCS characters. |
| |
| #### Delta (extension-only) conversion table files |
| |
| Physically, a binary conversion table (.cnv) file automatically contains both a |
| traditional "base table" data structure for the 1:1 mappings and a new |
| "extension table" for the m:n mappings if any are encountered in the .ucm file. |
| An extension table can also be requested manually by splitting the CHARMAP into |
| two. The first CHARMAP section will be used for the base table, and the second |
| only for the extension table. M:n mappings in the first CHARMAP will be moved to |
| the extension table. |
| |
| In order to save space for very similar conversion tables, it is possible to |
| create delta .cnv files that contain only an extension table and the name of |
| another .cnv file with a base table. The base file must be split into two |
| CHARMAPs such that the base file's base table does not contain any mappings that |
| contradict any of the delta file's mappings. |
| |
| The delta (extension-only) file uses only a single CHARMAP section. In addition, |
| it nees a line in the header that both causes building just a delta file and |
| specifies the name of the base file. For example, windows-936-2000.ucm contains |
| |
| <icu:base> “ibm-1386_P100-2002” |
| |
| makeconv ignores all mappings for the delta file that are also in the base |
| file's base table. If the two conversion tables are sufficiently similar, then |
| the delta file will contain only a relatively small set of mappings, which |
| results in a small .cnv file. At runtime, both the delta file and its base file |
| are loaded, and the base file's base table is used together with the extension |
| file. The base file works as a standalone file, using its own extension table |
| for its full set of mappings. The base file must be in the same ICU data package |
| as the delta file. |
| |
| The hard part is to split the base file's mappings into base and extension |
| CHARMAPs such that the base table does not overlap with any delta file, while |
| all shared mappings should be in the base table. (The base table data structure |
| is more compact than the extension table data structure.) |
| |
| ICU provides the ucmkbase tool in the |
| [ucmtools](http://source.icu-project.org/repos/icu/data/trunk/charset/source/ucmtools/) |
| collection to do this. |
| |
| For example, the following illustrates how to use ucmkbase to make a base .ucm |
| file for three Shift-JIS conversion table variants. (ibm-943_P15A-2003.ucm |
| becomes the base.) |
| |
| C:\\tmp\\icu\\ucm>ren ibm-943_P15A-2003.ucm ibm-943_P15A-2003.orig |
| C:\\tmp\\icu\\ucm>ucmkbase ibm-943_P15A-2003.orig ibm-943_P130-1999.ucm |
| ibm-942_P12A-1999.ucm > ibm-943_P15A-2003.ucm |
| |
| After this, the two delta .ucm files only need to get the following line added |
| before the start of their CHARMAPs: |
| |
| <icu:base> "ibm-943_P15A-2003" |
| |
| The ICU tools and runtime code handle DBCS-only conversion tables specially, |
| allowing them to be built into delta files with MBCS or EBCDIC_STATEFUL base |
| files without using their single-byte mappings, and without ucmkbase moving the |
| single-byte mappings of the base file into the base file's extension table. See |
| for example ibm-16684_P110-2003.ucm and ibm-1390_P110-2003.ucm. |
| |
| #### Other enhancements |
| |
| ICU 2.8 adds support for the specification of which unassigned Unicode code |
| points should be mapped to subchar1 rather than the default subchar. See the |
| discussion of subchar1 above for more details. |
| |
| The extension table data structure also removes one minor limitation on ICU |
| conversion tables: Fallback mappings to a single byte 00 are now allowed and |
| handled properly. ICU versions before 2.8 could only handle roundtrips to/from |
| 00. |
| |
| ### Examples for codepage state tables |
| |
| The following shows the exact implied state tables for non-MBCS types, A state |
| table may need to be overwritten in order to allow supplementary characters |
| (U+10000 and up). |
| |
| US-ASCII |
| |
| 0-7f |
| |
| This single-row state table describes US-ASCII. Byte values from 0 to 0x7f are |
| valid and map to Unicode characters up to U+ffff. Byte values from 0x80 to 0xff |
| are illegal. |
| |
| Shift-JIS |
| |
| 0-7f, 81-9f:1, a0-df, e0-fc:1 |
| 40-7e, 80-fc |
| |
| This two-row state table describes the Shift-JIS structure which encodes some |
| characters with one byte each and others with two bytes each. Bytes 0 to 0x7f |
| and 0xa0 to 0xdf are valid single-byte encodings. Bytes 0x81 to 0x9f and 0xe0 to |
| 0xfc are lead bytes. (For example, they are followed by one of the bytes that is |
| specified as valid in state 1). A byte sequence of 0x85 0x61 is valid while a |
| single byte of 0x80 or 0xff is illegal. Similarly, a byte sequence of 0x85 0x31 |
| is illegal. |
| |
| EUC-JP |
| |
| 0-8d, 8e:2, 8f:3, 90-9f, a1-fe:1 |
| a1-fe |
| a1-e4 |
| a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4 |
| a1-fe.u |
| |
| This fairly complicated state table describes EUC-JP. Valid byte sequences are |
| one, two, or three bytes long. Two-byte sequences have a lead byte of 0x8e and |
| end in state 2, or have lead bytes 0xa1 to 0xfe and end in state 1. Three-byte |
| sequences have a lead byte of 0x8f and continue in state 3. Some final byte |
| value ranges are entirely unassigned, therefore they end in state 4 with an |
| action letter of u for "unassigned" to save significant memory for the code |
| units table. Assigned three-byte sequences end in state 1 like most two-byte |
| sequences. |
| |
| SBCS default state table: |
| |
| 0-ff |
| |
| SBCS by default implies the structure for single-byte, 8-bit codepages. |
| |
| DBCS default state table: |
| |
| 0-3f:3, 40:2, 41-fe:1, ff:3 |
| 41-fe |
| 40 |
| |
| Important: |
| |
| These are four states — the fourth has an empty line (equivalent to 0-ff.i)! |
| DBCS codepages, by default, are defined with the EBCDIC double-byte structure. |
| Valid sequences are pairs of bytes from 0x41 to 0xfe and the one pair 0x40/0x40 |
| for the double-byte space. The structure is defined such that all illegal byte |
| sequences are always two in length. Therefore, every byte in the initial state |
| is a lead byte. |
| |
| EBCDIC_STATEFUL default state table: |
| |
| 0-ff, e:1.s, f:0.s |
| initial, 0-3f:4, e:1.s, f:0.s, 40:3, 41-fe:2, ff:4 |
| 0-40:1.i, 41-fe:1., ff:1.i |
| 0-ff:1.i, 40:1. |
| 0-ff:1.i |
| |
| This is the structure of Mixed Single-byte and Double-byte EBCDIC codepages, |
| which are stateful and use the Shift-In/Shift-Out (SI/SO) bytes 0x0f/0x0e. The |
| initial state 0 is almost the same as for SBCS except for SI and SO. State 1 is |
| also an initial state and is the basis for a state-shifted version of the DBCS |
| structure above. All double-byte sequences return to state 1 and SI switches |
| back to state 0. SI and SO are also allowed in their own states with no effect. |
| |
| *If a DBCS or EBCDIC_STATEFUL codepage maps supplementary (non-BMP) Unicode |
| characters, then a modified state table needs to be specified in the .ucm file. |
| The state table needs to use the surrogates designation for a table row or .p |
| for some entries.* |
| *The reuse of a final or intermediate state (shown for EUC-JP) is valid for as |
| long as there is no circle in the state chain. The mappings will be unique |
| because of the different path to the shared state (sharing a state saves some |
| memory; each state table row occupies 1kB in the .cnv file). This table also |
| shows the redefinition of byte value ranges within one state row (State number |
| 3)as shorthand. State 3 defines bytes a1-fe to go to state 1, but the following |
| entries redefine and override certain bytes to go to state 4.* |
| |
| An initial state never needs a surrogates designation or .p because Unicode |
| mapping results in initial states that are stored directly in the state table, |
| providing enough room in each cell. The size of a generated .cnv mapping table |
| file depends primarily on the number and distribution of the mappings and on the |
| number of valid, multi-byte sequences that the state table allows. Each state |
| table row takes up one kilobyte. |
| |
| For single-byte codepages, the state table cells contain all two-Unicode |
| mappings. Code point results for multi-byte sequences are stored in an array |
| with enough room for all valid byte sequences. For all byte sequences that end |
| in a surrogates or .p state, Unicode allocates two code units. |
| |
| If possible, valid state table entries may be changed to .u to reduce the number |
| of valid, assignable sequences and to make the .cnv file smaller. If additional |
| states are necessary, then each additional state itself adds 1kB to the file |
| size, diminishing the file size savings. See the EUC-JP example above. |
| |
| For codepages with up to two bytes per character, the makeconv tool |
| automatically compacts the bytes, if possible, by introducing one more trail |
| byte state. This state replaces valid entries in the original trail state with |
| unassigned entries and changes each lead byte entry to work with the new state |
| if there are no mappings with that lead byte. |
| |
| For codepages with up to three or four bytes per character, compaction must be |
| done manually. However, if the verbose option is set on the command line, the |
| makeconv tool will print useful information about unassigned byte sequences. |