docs/conversion_interface.htm - external/github.com/unicode-org/icu - Git at Google

 <html lang="en">
 <head>
 <meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
 <title>ICU Codepage Conversion</title>
 </head>

 <body>

 <h1>International Components for Unicode</h1>

 <h2>ICU Codepage Conversion</h2>

 <p>The ICU conversion API is a set of C functions used to convert to and from
 Unicode and various character sets (codepages, encodings, character encoding schemes).</p>

 <h3>Conversion-related files:</h3>

 <ul>
     <li>API: The API header files are in icu/source/common/unicode:<br>
         For C, the API is defined in ucnv.h;
         advanced functionality is also defined in ucnv_err.h (callbacks) and
         in ucnv_cb.h (output functions for custom callbacks).<br>
         For C++ the API is defined in convert.h (the C++ class is a wrapper around the C implementation).</li>
     <li>Implementation: The converter implementation files are in icu/source/common;
         all such files begin with "ucnv". The C++ wrapper implementation is in convert.cpp.</li>
     <li>Conversion table generation tool: The makeconv tool that generates binary conversion files
         from text files is in icu/source/tools/makeconv.
         It reads .ucm text files with a format that is close to what the AIX tool uconvdef uses.
         makeconv writes one binary, memory-mappable .cnv file per .ucm file.</li>
     <li>Conversion data: The .ucm text files with the conversion table data are all in
         the icu/data folder. During the build process, makeconv generates binary .cnv files from
         each of them, and the pkgdata tool includes them into the common data file.<br>
         In addition, the file icu/data/convrtrs.txt contains information about "aliases", i.e.,
         alternative names for converters. It is read by gencnval (in icu/source/tools/gencnval)
         which writes the binary file cnvalias.dat that also gets packaged into the common data file.</li>
 </ul>

 <h2>Converter types</h2>

 <p>In order to handle many kinds of character encoding schemes, ICU has a number of
 converter implementations, one per type. Some of these types are for purely algorithmic
 conversions that do not need to load data. For example, the UTF converters calculate
 Unicode code points from the input bytes, and vice versa. Also, the ISO_2022 converter
 starts without any specific conversion data table until it needs to - handling escape
 sequences and the general structure of ISO 2022 is done with static data.</p>

 <p>Many other encodings share common characteristics and need by definition tables
 to convert text between them and Unicode. A converter object for such an encoding
 is instantiated by loading a (.cnv) data file (typically from the single, common
 ICU data file) and associating it with a converter type implementation depending
 on the type information in the data.</p>

 <p>The following describes specifics about each converter type:</p>

 <h3>MBCS</h3>

 <p>The MBCS converter is a data-based converter for Multi-Byte Characater Sets.
 It has been reimplemented for ICU 1.6 to handle a wider range of such encodings.
 Its current capabilities and limitations are:
 <ul>
     <li>Support for variable-length, byte-based encodings with 1 to 4 bytes per character.</li>
     <li>Support for all Unicode characters (code points 0..0x10ffff).
         Since ICU uses UTF-16 as its Unicode encoding form, this means that surrogate
         pairs are fully supported.</li>
     <li>Efficient distinction of unassigned vs. illegal byte sequences.</li>
     <li>It would possible in fromUnicode() to directly deal with simple
         stateful encodings. (This is currently not used.)</li>
     <li>It is possible to convert Unicode code points other than U+0000
         to a single zero byte (but not as a fallback).</li>
     <li>It is not otherwise possible to convert from Unicode to byte sequences
         with leading zero bytes.</li>
 </ul>
 </p>

 <p>The conversion to Unicode uses a state machine to achieve the above capabilities with
 reasonable data file sizes. The state machine information itself is loaded with the
 conversion data and defines the structure of the codepage, including which byte sequences
 are valid, unassigned, and illegal. This data cannot (or not easily) be computed from
 the pure mapping data. Instead, the .ucm files for MBCS encodings have additional entries
 that are specific to ICU's makeconv and this converter type. They are additional header lines
 that start with <code>&lt;icu:state></code>. Each such line defines one state of the state machine.
 The state machine uses a table of as many rows as there are states (= as many as there are
 <code>&lt;icu:state></code> lines). Each row has 256 entries, one for each possible byte value.</p>

 <p>The state table lines in the .ucm header follow the following EBNF-like grammar
 (whitespace is allowed between all tokens):
 <pre>
     row=[firstentry ','] entry (',' entry)*
     firstentry="initial" | "surrogates"
                (initial state (default for state 0), output is all surrogate pairs)
 </pre>
 Each state table row description (that follows the <code>&lt;icu:state></code>)
 begins with an optional <code>initial</code> or <code>surrogates</code> keyword
 and is followed by one or more column entries.
 For the purpose of MBCS state tables, the states=rows in the table are numbered
 beginning with 0 at the first such line in the .ucm file header.
 The numbers are assigned implicitly by makeconv in order of the <code>&lt;icu:state></code>
 lines.
 <pre>
     entry=range [':' nextstate] ['.' [action]]
     range=number ['-' number]
     nextstate=number
               (0..7f)
     action='u' | 's' | 'p' | 'i'
            (unassigned, state change only, surrogate pair, illegal)
     number=(1- or 2-digit hexadecimal number)
 </pre>
 Each column entry consists at least of a hexadecimal byte value or value range
 and is separated by the following column entry by a comma.
 The column entry specifies how to interpret an input byte in the row's state.
 If neither a next state nor an action is explicitly specified - only the byte
 value (range) is given - then the byte value terminates the byte sequence,
 results in a valid mapping to a Unicode BMP character, and the state number is
 reset to 0.</p>

 <p>The next state can be explicitly specified with a separating
 colon (<code>:</code>) followed by the number of the state (=number/index of the row,
 starting at 0). This is mostly used for intermediate byte values, i.e., for
 bytes that are not the last ones in a sequence. The state machine needs to
 proceed to the next state and read another byte. In this case, no other action
 is specified.</p>

 <p>If the byte value(s) terminate(s) a byte sequence, then the byte
 sequence results in the following depending on the action that is announced with
 a period (<code>.</code>) followed by a letter:
 <ul>
     <li><code>u</code> - Unassigned. The byte sequence is valid but does not encode a character.</li>
     <li>(no letter) - valid. If no action letter is specified, then
         the byte sequence is valid and encodes a Unicode character up to
         U+ffff.</li>
     <li><code>p</code> - surrogate Pair. The byte sequence is valid and may result in</li>
     <li><code>i</code> - Illegal. The byte sequence is illegal. This is the default for
         all byte values in a row that are not otherwise specified with
         column entries.</li>
     <li><code>s</code> - State change only. The byte sequence does not encode any character
         but may change the state number. This could be used with simple, stateful
         encodings (using, for example, SI/SO codes),
         but ICU currently does not take advantage of it.</li>
 </ul>
 If an action is specified but no next state, then the next state number defaults to 0.
 In other words, a byte value (range) terminates a sequence if there is an action
 specified for it, or when there is neither an action nor a next state - in this case,
 it defaults to "valid, next state is 0" (equivalent to <code>:0.</code>).</p>

 <p>If a byte value is not specified in any column entry of a row, then it is
 illegal in the current state. If a byte value is specified in more than one column
 entry of the same row, then the last one is used. This allows to specify common
 properties for a wide byte value range followed by a few exceptions and is easier than
 having to specify mutually exclusive ranges, especially if many of them have the
 same properties.</p>

 <p>The optional keyword at the beginning of a state line has the following effect:
 <ul>
     <li><code>initial</code>: The state machine can start reading byte sequences
         in this state. State 0 is always an initial state. Only initial states can be
         next states for final byte values. In an initial state, the Unicode mappings
         for all final bytes are also stored directly in the state table.</li>
     <li><code>surrogates</code>: All Unicode mappings for final bytes in non-initial
         states are stored in a separate table of 16-bit Unicode (UTF-16) code units.
         Since most legacy codepages map only to Unicode code points up to U+ffff
         (the Basic Multilingual Plane, BMP), the default allocation per mapping
         result is one 16-bit unit. Individual byte values can be specified to map
         to surrogate pairs (= two 16-bit units) with action letter <code>p</code>.
         The <code>surrogates</code> keyword specifies this for the entire state (row).
         Surrogate pair mapping entries can still hold single units depending on the
         actual mapping data, but single-unit mapping entries cannot hold a pair of units.
         Mapping to single-unit entries is the default because the mapping is faster,
         uses half as much memory in the code units table, and is sufficient for most
         legacy codepages.</li>
 </ul>
 </p>

 <p>When converting to Unicode, the state machine starts in state number 0.
 In each iteration, it reads one input (codepage) byte and either just goes to
 the next state as specified, or treats it as a final byte with the specified action
 and an optional non-0 next (initial) state. This means that a state table needs to
 have at least as many state rows as the maximum number of bytes per character,
 which is the maximum length of any byte sequence.</p>

 <h4>Examples for MBCS state tables</h4>

 <ul>
     <li>US-ASCII:
     <pre>
     0-7f
     </pre>
     This single-row state table describes US-ASCII.
     Byte values from 0 to 0x7f are valid and map to Unicode character up to U+ffff.
     Byte values from 0x80 to 0xff are illegal.<br>
     &nbsp;</li>
     <li>Shift-JIS:
     <pre>
     0-7f, 81-9f:1, a0-df, e0-fc:1
     40-7e, 80-fc
     </pre>
     This two-row state table describes the structure of Shift-JIS, which encodes some characters
     with one byte each, and others with two bytes each.
     Bytes 0 to 0x7f and 0xa0 to 0xdf are valid single-byte encodings.
     Bytes 0x81 to 0x9f and 0xe0 to 0xfc are lead bytes, i.e., they are followed by one of
     the bytes that are specified as valid in state 1.
     A byte sequence of 0x85 0x61 is valid, while a single byte of 0x80 or 0xff is illegal.
     Similarly, a byte sequence of 0x85 0x31 is illegal.<br>
     &nbsp;</li>
     <li>EUC-JP:
     <pre>
     0-8d, 8e:2, 8f:3, 90-9f, a1-fe:1
     a1-fe
     a1-e4
     a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4
     a1-fe.u
     </pre>
     This fairly complicated state table describes EUC-JP.
     Valid byte sequences are one, two, or three bytes long.
     Two-byte sequences have lead byte 0x8e and end in state 2, or
     lead bytes 0xa1 to 0xfe and end in state 1.
     Three-byte sequences have a lead byte of 0x8f and continue in state 3.
     Some final byte value ranges are entirely unassigned, therefore they end in state 4
     with an action letter of <code>u</code> for "unassigned" to save significant memory
     for the code units table.
     Assigned three-byte sequences end in state 1 like most two-byte sequences.<br>
     <em>Note: </em>This reuse of a final or intermediate state is valid for as long
     as there is no circle in the state chain. The mappings will be unique because of
     the different path to the shared state.
     (Sharing a state saves some memory: Each state table row occupies 1kB in the .cnv file.)<br>
     This table also shows the redefinition of byte value ranges within one state row
     (number 3) as a shorthand.<br>
     &nbsp;</li>
 </ul>

 </body>
 </html>
	<html lang="en">
	<head>
	<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
	<title>ICU Codepage Conversion</title>
	</head>

	<body>

	<h1>International Components for Unicode</h1>

	<h2>ICU Codepage Conversion</h2>

	<p>The ICU conversion API is a set of C functions used to convert to and from
	Unicode and various character sets (codepages, encodings, character encoding schemes).</p>

	<h3>Conversion-related files:</h3>

	<ul>
	<li>API: The API header files are in icu/source/common/unicode:<br>
	For C, the API is defined in ucnv.h;
	advanced functionality is also defined in ucnv_err.h (callbacks) and
	in ucnv_cb.h (output functions for custom callbacks).<br>
	For C++ the API is defined in convert.h (the C++ class is a wrapper around the C implementation).</li>
	<li>Implementation: The converter implementation files are in icu/source/common;
	all such files begin with "ucnv". The C++ wrapper implementation is in convert.cpp.</li>
	<li>Conversion table generation tool: The makeconv tool that generates binary conversion files
	from text files is in icu/source/tools/makeconv.
	It reads .ucm text files with a format that is close to what the AIX tool uconvdef uses.
	makeconv writes one binary, memory-mappable .cnv file per .ucm file.</li>
	<li>Conversion data: The .ucm text files with the conversion table data are all in
	the icu/data folder. During the build process, makeconv generates binary .cnv files from
	each of them, and the pkgdata tool includes them into the common data file.<br>
	In addition, the file icu/data/convrtrs.txt contains information about "aliases", i.e.,
	alternative names for converters. It is read by gencnval (in icu/source/tools/gencnval)
	which writes the binary file cnvalias.dat that also gets packaged into the common data file.</li>
	</ul>

	<h2>Converter types</h2>

	<p>In order to handle many kinds of character encoding schemes, ICU has a number of
	converter implementations, one per type. Some of these types are for purely algorithmic
	conversions that do not need to load data. For example, the UTF converters calculate
	Unicode code points from the input bytes, and vice versa. Also, the ISO_2022 converter
	starts without any specific conversion data table until it needs to - handling escape
	sequences and the general structure of ISO 2022 is done with static data.</p>

	<p>Many other encodings share common characteristics and need by definition tables
	to convert text between them and Unicode. A converter object for such an encoding
	is instantiated by loading a (.cnv) data file (typically from the single, common
	ICU data file) and associating it with a converter type implementation depending
	on the type information in the data.</p>

	<p>The following describes specifics about each converter type:</p>

	<h3>MBCS</h3>

	<p>The MBCS converter is a data-based converter for Multi-Byte Characater Sets.
	It has been reimplemented for ICU 1.6 to handle a wider range of such encodings.
	Its current capabilities and limitations are:
	<ul>
	<li>Support for variable-length, byte-based encodings with 1 to 4 bytes per character.</li>
	<li>Support for all Unicode characters (code points 0..0x10ffff).
	Since ICU uses UTF-16 as its Unicode encoding form, this means that surrogate
	pairs are fully supported.</li>
	<li>Efficient distinction of unassigned vs. illegal byte sequences.</li>
	<li>It would possible in fromUnicode() to directly deal with simple
	stateful encodings. (This is currently not used.)</li>
	<li>It is possible to convert Unicode code points other than U+0000
	to a single zero byte (but not as a fallback).</li>
	<li>It is not otherwise possible to convert from Unicode to byte sequences
	with leading zero bytes.</li>
	</ul>
	</p>

	<p>The conversion to Unicode uses a state machine to achieve the above capabilities with
	reasonable data file sizes. The state machine information itself is loaded with the
	conversion data and defines the structure of the codepage, including which byte sequences
	are valid, unassigned, and illegal. This data cannot (or not easily) be computed from
	the pure mapping data. Instead, the .ucm files for MBCS encodings have additional entries
	that are specific to ICU's makeconv and this converter type. They are additional header lines
	that start with <code><icu:state></code>. Each such line defines one state of the state machine.
	The state machine uses a table of as many rows as there are states (= as many as there are
	<code><icu:state></code> lines). Each row has 256 entries, one for each possible byte value.</p>

	<p>The state table lines in the .ucm header follow the following EBNF-like grammar
	(whitespace is allowed between all tokens):
	<pre>
	row=[firstentry ','] entry (',' entry)*
	firstentry="initial" \| "surrogates"
	(initial state (default for state 0), output is all surrogate pairs)
	</pre>
	Each state table row description (that follows the <code><icu:state></code>)
	begins with an optional <code>initial</code> or <code>surrogates</code> keyword
	and is followed by one or more column entries.
	For the purpose of MBCS state tables, the states=rows in the table are numbered
	beginning with 0 at the first such line in the .ucm file header.
	The numbers are assigned implicitly by makeconv in order of the <code><icu:state></code>
	lines.
	<pre>
	entry=range [':' nextstate] ['.' [action]]
	range=number ['-' number]
	nextstate=number
	(0..7f)
	action='u' \| 's' \| 'p' \| 'i'
	(unassigned, state change only, surrogate pair, illegal)
	number=(1- or 2-digit hexadecimal number)
	</pre>
	Each column entry consists at least of a hexadecimal byte value or value range
	and is separated by the following column entry by a comma.
	The column entry specifies how to interpret an input byte in the row's state.
	If neither a next state nor an action is explicitly specified - only the byte
	value (range) is given - then the byte value terminates the byte sequence,
	results in a valid mapping to a Unicode BMP character, and the state number is
	reset to 0.</p>

	<p>The next state can be explicitly specified with a separating
	colon (<code>:</code>) followed by the number of the state (=number/index of the row,
	starting at 0). This is mostly used for intermediate byte values, i.e., for
	bytes that are not the last ones in a sequence. The state machine needs to
	proceed to the next state and read another byte. In this case, no other action
	is specified.</p>

	<p>If the byte value(s) terminate(s) a byte sequence, then the byte
	sequence results in the following depending on the action that is announced with
	a period (<code>.</code>) followed by a letter:
	<ul>
	<li><code>u</code> - Unassigned. The byte sequence is valid but does not encode a character.</li>
	<li>(no letter) - valid. If no action letter is specified, then
	the byte sequence is valid and encodes a Unicode character up to
	U+ffff.</li>
	<li><code>p</code> - surrogate Pair. The byte sequence is valid and may result in</li>
	<li><code>i</code> - Illegal. The byte sequence is illegal. This is the default for
	all byte values in a row that are not otherwise specified with
	column entries.</li>
	<li><code>s</code> - State change only. The byte sequence does not encode any character
	but may change the state number. This could be used with simple, stateful
	encodings (using, for example, SI/SO codes),
	but ICU currently does not take advantage of it.</li>
	</ul>
	If an action is specified but no next state, then the next state number defaults to 0.
	In other words, a byte value (range) terminates a sequence if there is an action
	specified for it, or when there is neither an action nor a next state - in this case,
	it defaults to "valid, next state is 0" (equivalent to <code>:0.</code>).</p>

	<p>If a byte value is not specified in any column entry of a row, then it is
	illegal in the current state. If a byte value is specified in more than one column
	entry of the same row, then the last one is used. This allows to specify common
	properties for a wide byte value range followed by a few exceptions and is easier than
	having to specify mutually exclusive ranges, especially if many of them have the
	same properties.</p>

	<p>The optional keyword at the beginning of a state line has the following effect:
	<ul>
	<li><code>initial</code>: The state machine can start reading byte sequences
	in this state. State 0 is always an initial state. Only initial states can be
	next states for final byte values. In an initial state, the Unicode mappings
	for all final bytes are also stored directly in the state table.</li>
	<li><code>surrogates</code>: All Unicode mappings for final bytes in non-initial
	states are stored in a separate table of 16-bit Unicode (UTF-16) code units.
	Since most legacy codepages map only to Unicode code points up to U+ffff
	(the Basic Multilingual Plane, BMP), the default allocation per mapping
	result is one 16-bit unit. Individual byte values can be specified to map
	to surrogate pairs (= two 16-bit units) with action letter <code>p</code>.
	The <code>surrogates</code> keyword specifies this for the entire state (row).
	Surrogate pair mapping entries can still hold single units depending on the
	actual mapping data, but single-unit mapping entries cannot hold a pair of units.
	Mapping to single-unit entries is the default because the mapping is faster,
	uses half as much memory in the code units table, and is sufficient for most
	legacy codepages.</li>
	</ul>
	</p>

	<p>When converting to Unicode, the state machine starts in state number 0.
	In each iteration, it reads one input (codepage) byte and either just goes to
	the next state as specified, or treats it as a final byte with the specified action
	and an optional non-0 next (initial) state. This means that a state table needs to
	have at least as many state rows as the maximum number of bytes per character,
	which is the maximum length of any byte sequence.</p>

	<h4>Examples for MBCS state tables</h4>

	<ul>
	<li>US-ASCII:
	<pre>
	0-7f
	</pre>
	This single-row state table describes US-ASCII.
	Byte values from 0 to 0x7f are valid and map to Unicode character up to U+ffff.
	Byte values from 0x80 to 0xff are illegal.<br>
	</li>
	<li>Shift-JIS:
	<pre>
	0-7f, 81-9f:1, a0-df, e0-fc:1
	40-7e, 80-fc
	</pre>
	This two-row state table describes the structure of Shift-JIS, which encodes some characters
	with one byte each, and others with two bytes each.
	Bytes 0 to 0x7f and 0xa0 to 0xdf are valid single-byte encodings.
	Bytes 0x81 to 0x9f and 0xe0 to 0xfc are lead bytes, i.e., they are followed by one of
	the bytes that are specified as valid in state 1.
	A byte sequence of 0x85 0x61 is valid, while a single byte of 0x80 or 0xff is illegal.
	Similarly, a byte sequence of 0x85 0x31 is illegal.<br>
	</li>
	<li>EUC-JP:
	<pre>
	0-8d, 8e:2, 8f:3, 90-9f, a1-fe:1
	a1-fe
	a1-e4
	a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4
	a1-fe.u
	</pre>
	This fairly complicated state table describes EUC-JP.
	Valid byte sequences are one, two, or three bytes long.
	Two-byte sequences have lead byte 0x8e and end in state 2, or
	lead bytes 0xa1 to 0xfe and end in state 1.
	Three-byte sequences have a lead byte of 0x8f and continue in state 3.
	Some final byte value ranges are entirely unassigned, therefore they end in state 4
	with an action letter of <code>u</code> for "unassigned" to save significant memory
	for the code units table.
	Assigned three-byte sequences end in state 1 like most two-byte sequences.<br>
	<em>Note: </em>This reuse of a final or intermediate state is valid for as long
	as there is no circle in the state chain. The mappings will be unique because of
	the different path to the shared state.
	(Sharing a state saves some memory: Each state table row occupies 1kB in the .cnv file.)<br>
	This table also shows the redefinition of byte value ranges within one state row
	(number 3) as a shorthand.<br>
	</li>
	</ul>

	</body>
	</html>