| --- |
| layout: default |
| title: Conversion |
| nav_order: 4 |
| has_children: true |
| --- |
| <!-- |
| © 2020 and later: Unicode, Inc. and others. |
| License & terms of use: http://www.unicode.org/copyright.html |
| --> |
| |
| # Conversion |
| {: .no_toc } |
| |
| ## Contents |
| {: .no_toc .text-delta } |
| |
| 1. TOC |
| {:toc} |
| |
| --- |
| |
| ## Conversion Overview |
| |
| A converter is used to convert from one character encoding to another. In the |
| case of ICU, the conversion is always between Unicode and another encoding, or |
| vice-versa. A text encoding is a particular mapping from a given character set |
| definition to the actual bits used to represent the data. |
| |
| Unicode provides a single character set that covers the major languages of the |
| world, and a small number of machine-friendly encoding forms and schemes to fit |
| the needs of existing applications and protocols. It is designed for best |
| interoperability with both ASCII and ISO-8859-1 (the most widely used character |
| sets) to make it easier for Unicode to be used in almost all applications and |
| protocols. |
| |
| Hundreds of encodings have been developed over the years, each for small groups |
| of languages and for special purposes. As a result, the interpretation of text, |
| input, sorting, display, and storage depends on the knowledge of all the |
| different types of character sets and their encodings. Programs have been |
| written to handle either one single encoding at a time and switch between them, |
| or to convert between external and internal encodings. |
| |
| There is no single, authoritative source of precise definitions of many of the |
| encodings and their names. However, |
| [IANA](http://www.iana.org/assignments/character-sets) is the best source for |
| names, and our Character Set repository is a good source of encoding definitions |
| for each platform. |
| |
| The transferring of text from one machine to another one often causes some loss |
| of information. Some platforms have a different interpretation of the text than |
| the other platforms. For example, Shift-JIS can be interpreted differently on |
| Windows™ compared to UNIX®. Windows maps byte value 0x5C to the backslash |
| symbol, while some UNIX machines map that byte value to the Yen symbol. Another |
| problem arises when a character in the codepage looks like the Unicode Greek |
| letter Mu or the Unicode micro symbol. Some platforms map this codepage byte |
| sequence to one Unicode character, while another platform maps it to the other |
| Unicode character. Fallbacks can partially fix this problem by mapping both |
| Unicode characters to the same codepage byte sequence. Even though some |
| character information is lost, the text is still readable. |
| |
| ICU's converter API has the following main features: |
| |
| 1. Unicode surrogate support |
| |
| 2. Support for all major encodings |
| |
| 3. Consistent text conversion across all computer platforms |
| |
| 4. Text data can be streamed (buffered) through the API |
| |
| 5. Fast text conversion |
| |
| 6. Supports fallbacks to the codepage |
| |
| 7. Supports reverse fallbacks to Unicode |
| |
| 8. Allows callbacks for handling and substituting invalid or unmapped byte |
| sequences |
| |
| 9. Allows a user to add support for unsupported encodings |
| |
| This section deals with the processes of converting encodings to and from |
| Unicode. |
| |
| ## Recommendations |
| |
| 1. **Use Unicode encodings whenever possible.** Together with Unicode for |
| internal processing, it makes completely globalized systems possible and |
| avoids the many problems with non-algorithmic conversions. (For a discussion |
| of such problems, see for example ["Character Conversions and Mapping |
| Tables"](http://icu-project.org/docs/papers/conversions_and_mappings_iuc19.ppt) |
| on <http://icu-project.org/docs/> and the [XML Japanese |
| Profile](http://www.w3.org/TR/japanese-xml/)). |
| |
| 1. Use UTF-8 and UTF-16. |
| |
| 2. Use UTF-16BE, SCSU and BOCU-1 as appropriate. |
| |
| 3. In special environments, other Unicode encodings may be used as well, |
| such as UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF-7, UTF-EBCDIC, and |
| CESU-8. (For turning Unicode filenames into ASCII-only filename strings, |
| the IMAP-mailbox-name encoding can be used.) |
| |
| 4. Do not exchange text with single/unpaired surrogates. |
| |
| 2. **Use legacy charsets only when absolutely necessary**. For best data |
| fidelity: |
| |
| 1. ISO-8859-1 is relatively unproblematic — if its limited character |
| repertoire is sufficient — because it is converted trivially (1:1) to |
| Unicode, avoiding conversion table problems for its small set of |
| characters. (By contrast, proper conversion from US-ASCII requires a |
| check for illegal byte values 0x80..0xff, which is an unnecessary |
| complication for modern systems with 8-bit bytes. ISO-8859-1 is nearly |
| as ubiquitous for modern systems as US-ASCII was for 7-bit systems.) |
| |
| 2. If you need to communicate with a certain platform, then use the same |
| conversion tables as that platform itself, or at least ones that are |
| very, very close. |
| |
| 3. ICU's conversion table repository contains hundreds of Unicode |
| conversion tables from a number of common vendors and platforms as well |
| as comparisons between these conversion tables: |
| <http://icu-project.org/charts/charset/> . |
| |
| 4. Do not trust codepage documentation that is not machine-readable, for |
| example nice-looking charts: They are usually incomplete and out of |
| date. |
| |
| 5. ICU's default build includes about 200 conversion tables. See the [ICU |
| Data](../icudata.md) chapter for how to add or remove conversion tables |
| and other data. |
| |
| 6. In ICU, you can (and should) also use APIs that map a charset name |
| together with a standard/platform name. This allows you to get different |
| converters for the same ambiguous charset name (like "Shift-JIS"), |
| depending on the standard or platform specified. See the |
| [convrtrs.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt) |
| alias table, the [Using Converters](converters.md) chapter and [API |
| references](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html) . |
| |
| 7. For data exchange (rather than pure display), turn off fallback |
| mappings: `ucnv_setFallback(cnv, FALSE)`; |
| |
| 8. For some text formats, especially XML and HTML, it is possible to set an |
| "escape callback" function that turns unmappable Unicode code points |
| into corresponding escape sequences, preventing data loss. See the API |
| references and the [ucnv sample |
| code](https://github.com/unicode-org/icu/tree/master/icu4c/source/samples/ucnv/) |
| . |
| |
| 9. **Never modify a conversion table.** Instead, use existing ones that |
| match precisely those in systems with which you communicate. "Modifying" |
| a conversion table in reality just creates a new one, which makes the |
| whole situation even less manageable. |