docs/userguide/conversion/index.md - external/github.com/unicode-org/icu - Git at Google

 ---
 layout: default
 title: Conversion
 nav_order: 4
 has_children: true
 ---
 <!--
 © 2020 and later: Unicode, Inc. and others.
 License & terms of use: http://www.unicode.org/copyright.html
 -->

 # Conversion
 {: .no_toc }

 ## Contents
 {: .no_toc .text-delta }

 1. TOC
 {:toc}

 ---

 ## Conversion Overview

 A converter is used to convert from one character encoding to another. In the
 case of ICU, the conversion is always between Unicode and another encoding, or
 vice-versa. A text encoding is a particular mapping from a given character set
 definition to the actual bits used to represent the data.

 Unicode provides a single character set that covers the major languages of the
 world, and a small number of machine-friendly encoding forms and schemes to fit
 the needs of existing applications and protocols. It is designed for best
 interoperability with both ASCII and ISO-8859-1 (the most widely used character
 sets) to make it easier for Unicode to be used in almost all applications and
 protocols.

 Hundreds of encodings have been developed over the years, each for small groups
 of languages and for special purposes. As a result, the interpretation of text,
 input, sorting, display, and storage depends on the knowledge of all the
 different types of character sets and their encodings. Programs have been
 written to handle either one single encoding at a time and switch between them,
 or to convert between external and internal encodings.

 There is no single, authoritative source of precise definitions of many of the
 encodings and their names. However,
 [IANA](http://www.iana.org/assignments/character-sets) is the best source for
 names, and our Character Set repository is a good source of encoding definitions
 for each platform.

 The transferring of text from one machine to another one often causes some loss
 of information. Some platforms have a different interpretation of the text than
 the other platforms. For example, Shift-JIS can be interpreted differently on
 Windows™ compared to UNIX®. Windows maps byte value 0x5C to the backslash
 symbol, while some UNIX machines map that byte value to the Yen symbol. Another
 problem arises when a character in the codepage looks like the Unicode Greek
 letter Mu or the Unicode micro symbol. Some platforms map this codepage byte
 sequence to one Unicode character, while another platform maps it to the other
 Unicode character. Fallbacks can partially fix this problem by mapping both
 Unicode characters to the same codepage byte sequence. Even though some
 character information is lost, the text is still readable.

 ICU's converter API has the following main features:

 1.  Unicode surrogate support

 2.  Support for all major encodings

 3.  Consistent text conversion across all computer platforms

 4.  Text data can be streamed (buffered) through the API

 5.  Fast text conversion

 6.  Supports fallbacks to the codepage

 7.  Supports reverse fallbacks to Unicode

 8.  Allows callbacks for handling and substituting invalid or unmapped byte
     sequences

 9.  Allows a user to add support for unsupported encodings

 This section deals with the processes of converting encodings to and from
 Unicode.

 ## Recommendations

 1.  **Use Unicode encodings whenever possible.** Together with Unicode for
     internal processing, it makes completely globalized systems possible and
     avoids the many problems with non-algorithmic conversions. (For a discussion
     of such problems, see for example ["Character Conversions and Mapping
     Tables"](http://icu-project.org/docs/papers/conversions_and_mappings_iuc19.ppt)
     on <http://icu-project.org/docs/> and the [XML Japanese
     Profile](http://www.w3.org/TR/japanese-xml/)).

     1.  Use UTF-8 and UTF-16.

     2.  Use UTF-16BE, SCSU and BOCU-1 as appropriate.

     3.  In special environments, other Unicode encodings may be used as well,
         such as UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF-7, UTF-EBCDIC, and
         CESU-8. (For turning Unicode filenames into ASCII-only filename strings,
         the IMAP-mailbox-name encoding can be used.)

     4.  Do not exchange text with single/unpaired surrogates.

 2.  **Use legacy charsets only when absolutely necessary**. For best data
     fidelity:

     1.  ISO-8859-1 is relatively unproblematic — if its limited character
         repertoire is sufficient — because it is converted trivially (1:1) to
         Unicode, avoiding conversion table problems for its small set of
         characters. (By contrast, proper conversion from US-ASCII requires a
         check for illegal byte values 0x80..0xff, which is an unnecessary
         complication for modern systems with 8-bit bytes. ISO-8859-1 is nearly
         as ubiquitous for modern systems as US-ASCII was for 7-bit systems.)

     2.  If you need to communicate with a certain platform, then use the same
         conversion tables as that platform itself, or at least ones that are
         very, very close.

     3.  ICU's conversion table repository contains hundreds of Unicode
         conversion tables from a number of common vendors and platforms as well
         as comparisons between these conversion tables:
         <http://icu-project.org/charts/charset/> .

     4.  Do not trust codepage documentation that is not machine-readable, for
         example nice-looking charts: They are usually incomplete and out of
         date.

     5.  ICU's default build includes about 200 conversion tables. See the [ICU
         Data](../icudata.md) chapter for how to add or remove conversion tables
         and other data.

     6.  In ICU, you can (and should) also use APIs that map a charset name
         together with a standard/platform name. This allows you to get different
         converters for the same ambiguous charset name (like "Shift-JIS"),
         depending on the standard or platform specified. See the
         [convrtrs.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt)
         alias table, the [Using Converters](converters.md) chapter and [API
         references](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html) .

     7.  For data exchange (rather than pure display), turn off fallback
         mappings: `ucnv_setFallback(cnv, FALSE)`;

     8.  For some text formats, especially XML and HTML, it is possible to set an
         "escape callback" function that turns unmappable Unicode code points
         into corresponding escape sequences, preventing data loss. See the API
         references and the [ucnv sample
         code](https://github.com/unicode-org/icu/tree/master/icu4c/source/samples/ucnv/)
         .

     9.  **Never modify a conversion table.** Instead, use existing ones that
         match precisely those in systems with which you communicate. "Modifying"
         a conversion table in reality just creates a new one, which makes the
         whole situation even less manageable.
	---
	layout: default
	title: Conversion
	nav_order: 4
	has_children: true
	---
	<!--
	© 2020 and later: Unicode, Inc. and others.
	License & terms of use: http://www.unicode.org/copyright.html
	-->

	# Conversion
	{: .no_toc }

	## Contents
	{: .no_toc .text-delta }

	1. TOC
	{:toc}

	---

	## Conversion Overview

	A converter is used to convert from one character encoding to another. In the
	case of ICU, the conversion is always between Unicode and another encoding, or
	vice-versa. A text encoding is a particular mapping from a given character set
	definition to the actual bits used to represent the data.

	Unicode provides a single character set that covers the major languages of the
	world, and a small number of machine-friendly encoding forms and schemes to fit
	the needs of existing applications and protocols. It is designed for best
	interoperability with both ASCII and ISO-8859-1 (the most widely used character
	sets) to make it easier for Unicode to be used in almost all applications and
	protocols.

	Hundreds of encodings have been developed over the years, each for small groups
	of languages and for special purposes. As a result, the interpretation of text,
	input, sorting, display, and storage depends on the knowledge of all the
	different types of character sets and their encodings. Programs have been
	written to handle either one single encoding at a time and switch between them,
	or to convert between external and internal encodings.

	There is no single, authoritative source of precise definitions of many of the
	encodings and their names. However,
	[IANA](http://www.iana.org/assignments/character-sets) is the best source for
	names, and our Character Set repository is a good source of encoding definitions
	for each platform.

	The transferring of text from one machine to another one often causes some loss
	of information. Some platforms have a different interpretation of the text than
	the other platforms. For example, Shift-JIS can be interpreted differently on
	Windows™ compared to UNIX®. Windows maps byte value 0x5C to the backslash
	symbol, while some UNIX machines map that byte value to the Yen symbol. Another
	problem arises when a character in the codepage looks like the Unicode Greek
	letter Mu or the Unicode micro symbol. Some platforms map this codepage byte
	sequence to one Unicode character, while another platform maps it to the other
	Unicode character. Fallbacks can partially fix this problem by mapping both
	Unicode characters to the same codepage byte sequence. Even though some
	character information is lost, the text is still readable.

	ICU's converter API has the following main features:

	1. Unicode surrogate support

	2. Support for all major encodings

	3. Consistent text conversion across all computer platforms

	4. Text data can be streamed (buffered) through the API

	5. Fast text conversion

	6. Supports fallbacks to the codepage

	7. Supports reverse fallbacks to Unicode

	8. Allows callbacks for handling and substituting invalid or unmapped byte
	sequences

	9. Allows a user to add support for unsupported encodings

	This section deals with the processes of converting encodings to and from
	Unicode.

	## Recommendations

	1. Use Unicode encodings whenever possible. Together with Unicode for
	internal processing, it makes completely globalized systems possible and
	avoids the many problems with non-algorithmic conversions. (For a discussion
	of such problems, see for example ["Character Conversions and Mapping
	Tables"](http://icu-project.org/docs/papers/conversions_and_mappings_iuc19.ppt)
	on <http://icu-project.org/docs/> and the [XML Japanese
	Profile](http://www.w3.org/TR/japanese-xml/)).

	1. Use UTF-8 and UTF-16.

	2. Use UTF-16BE, SCSU and BOCU-1 as appropriate.

	3. In special environments, other Unicode encodings may be used as well,
	such as UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF-7, UTF-EBCDIC, and
	CESU-8. (For turning Unicode filenames into ASCII-only filename strings,
	the IMAP-mailbox-name encoding can be used.)

	4. Do not exchange text with single/unpaired surrogates.

	2. Use legacy charsets only when absolutely necessary. For best data
	fidelity:

	1. ISO-8859-1 is relatively unproblematic — if its limited character
	repertoire is sufficient — because it is converted trivially (1:1) to
	Unicode, avoiding conversion table problems for its small set of
	characters. (By contrast, proper conversion from US-ASCII requires a
	check for illegal byte values 0x80..0xff, which is an unnecessary
	complication for modern systems with 8-bit bytes. ISO-8859-1 is nearly
	as ubiquitous for modern systems as US-ASCII was for 7-bit systems.)

	2. If you need to communicate with a certain platform, then use the same
	conversion tables as that platform itself, or at least ones that are
	very, very close.

	3. ICU's conversion table repository contains hundreds of Unicode
	conversion tables from a number of common vendors and platforms as well
	as comparisons between these conversion tables:
	<http://icu-project.org/charts/charset/> .

	4. Do not trust codepage documentation that is not machine-readable, for
	example nice-looking charts: They are usually incomplete and out of
	date.

	5. ICU's default build includes about 200 conversion tables. See the [ICU
	Data](../icudata.md) chapter for how to add or remove conversion tables
	and other data.

	6. In ICU, you can (and should) also use APIs that map a charset name
	together with a standard/platform name. This allows you to get different
	converters for the same ambiguous charset name (like "Shift-JIS"),
	depending on the standard or platform specified. See the
	[convrtrs.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt)
	alias table, the [Using Converters](converters.md) chapter and [API
	references](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html) .

	7. For data exchange (rather than pure display), turn off fallback
	mappings: `ucnv_setFallback(cnv, FALSE)`;

	8. For some text formats, especially XML and HTML, it is possible to set an
	"escape callback" function that turns unmappable Unicode code points
	into corresponding escape sequences, preventing data loss. See the API
	references and the [ucnv sample
	code](https://github.com/unicode-org/icu/tree/master/icu4c/source/samples/ucnv/)
	.

	9. Never modify a conversion table. Instead, use existing ones that
	match precisely those in systems with which you communicate. "Modifying"
	a conversion table in reality just creates a new one, which makes the
	whole situation even less manageable.