| --- |
| layout: default |
| title: Converter |
| nav_order: 1 |
| parent: Conversion |
| --- |
| <!-- |
| © 2020 and later: Unicode, Inc. and others. |
| License & terms of use: http://www.unicode.org/copyright.html |
| --> |
| |
| # Using Converters |
| {: .no_toc } |
| |
| ## Contents |
| {: .no_toc .text-delta } |
| |
| 1. TOC |
| {:toc} |
| |
| --- |
| |
| ## Overview |
| |
| When designing applications around Unicode characters, it is sometimes required |
| to convert between Unicode encodings or between Unicode and legacy text data. |
| The vast majority of modern Operating Systems support Unicode to some degree, |
| but sometimes the legacy text data from older systems need to be converted to |
| and from Unicode. This conversion process can be done with an ICU converter. |
| |
| ## ICU converters |
| |
| ICU provides comprehensive character set conversion services, mapping tables, |
| and implementations for many encodings. Since ICU uses Unicode (UTF-16) |
| internally, all converters convert between UTF-16 (with the endianness according |
| to the current platform) and another encoding. This includes Unicode encodings. |
| In other words, internal text is 16-bit Unicode, while "external text" used as |
| source or target for a conversion is always treated as a byte stream. |
| |
| ICU converters are available for a wide range of encoding schemes. Most of them |
| are based on mapping table data that is handled by few generic implementations. |
| Some encodings are implemented algorithmically in addition to (or instead of) |
| using mapping tables, especially Unicode encodings. The partly or entirely |
| table-based encoding schemes include: All ICU converters map only single Unicode |
| character code points to and from single codepage character code points. ICU |
| converters **do not** deal directly with combining characters, bidirectional |
| reordering, or Arabic shaping, for example. Such processes, if required, must be |
| handled separately. For example, while in Unicode, the ICU BiDi APIs can be used |
| for bidirectional reordering after a conversion to Unicode or before a |
| conversion from Unicode. |
| |
| ICU converters are not designed to perform any encoding autodetection. This |
| means that the converters do not autodetect "endianness", the 6 Unicode encoding |
| signatures, or the Shift-JIS vs. EUC-JP, etc. There are two exceptions: The |
| UTF-16 and UTF-32 converters work according to Unicode's specification of their |
| Character Encoding Schemes, that is, they read the BOM to figure out the actual |
| "endianness". |
| |
| The ICU mapping tables mostly come from an [IBM® codepage |
| repository](http://www.ibm.com/software/globalization/cdra). For non-IBM |
| codepages, there is typically an equivalent codepage registered with this |
| repository. However, the textual data format (.ucm files) is generic, and data |
| for other codepage mapping tables can also be added. |
| |
| ## Using the Default Codepage |
| |
| ICU has code to determine the default codepage of the system or process. This |
| default codepage can be used to convert `char *` strings to and from Unicode. |
| |
| Depending on system design, setup and APIs, it may not always be possible to |
| find a default codepage that fully works as expected. For example, |
| |
| 1. On Windows there are three encodings in use at the same time. Unicode |
| (UTF-16) is always used inside of Windows, while for `char *` encodings there |
| are two classes, called "ANSI" and "OEM" codepages. ICU will use the ANSI |
| codepage. Note that the OEM codepage is used by default for console window |
| output. |
| |
| 2. On some UNIX-type systems, non-standard names are used for encodings, or |
| non-standard encodings are used altogether. Although ICU supports over 200 |
| encodings in its standard build and many more aliases for them, it will not |
| be able to recognize such non-standard names. |
| |
| 3. Some systems do not have a notion of a system or process codepage, and may |
| not have APIs for that. |
| |
| If you have means of detecting a default codepage name that are more appropriate |
| for your application, then you should set that name with `ucnv_setDefaultName()` |
| as the first ICU function call. This makes sure that the internally cached |
| default converter will be instantiated from your preferred name. |
| |
| Starting in ICU 2.0, when a converter for the default codepage cannot be opened, |
| a fallback default codepage name and converter will be used. On most platforms, |
| this will be US-ASCII. For z/OS (OS/390), ibm-1047,swaplfnl is the default |
| fallback codepage. For AS/400 (iSeries), ibm-37 is the default fallback |
| codepage. This default fallback codepage is used when the operating system is |
| using a non-standard name for a default codepage, or the converter was not |
| packaged with ICU. The feature allows ICU to run in unusual computing |
| environments without completely failing. |
| |
| ## Usage Model |
| |
| A "Converter" refers to the C structure "UConverter". Converters are cheap to |
| create. Any data that is shared between converters of the same kind (such as the |
| mappings, the name and the properties) are automatically cached and shared in |
| memory. |
| |
| ### Converter Names |
| |
| Codepages with encoding schemes have been given many names by various vendors |
| and platforms over the years. Vendors have different ways specify which codepage |
| and encoding are being used. IBM uses a CCSID (Coded Character Set IDentifier). |
| Windows uses a CPID (CodePage IDentifier). Macintosh has a TextEncoding. Many |
| Unix vendors use [IANA](http://www.iana.org/assignments/character-sets) |
| character set names. Many of these names are aliases to converters within ICU. |
| |
| In order to help identify which names are recognized by certain platforms, ICU |
| provides several converter alias functions. The complete description of these |
| functions can be found in the [ICU API Reference](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html) . |
| |
| | Function Names | Short Description | |
| | -------------- | ----------------- | |
| | `ucnv_countAvailable`, `ucnv_getAvailableName` | Get a list of available converter names that can be opened. | |
| | `ucnv_openAllNames` | Get a list of all known converter names. | |
| | `ucnv_getName` | Get the name of an open converter. | |
| | `ucnv_countAliases`, `ucnv_getAlias` | Get the list of aliases for the specified converter. | |
| | `ucnv_countStandards`, `ucnv_getStandard` | Get the list of known standards. | |
| | `ucnv_openStandardNames` | Get a filtered list of aliases for a converter that is known by the specified standard. | |
| | `ucnv_getStandardName` | Get the preferred alias name specified by a given standard. | |
| | `ucnv_getCanonicalName` | Get the converter name from the alias that is recognized by the specified standard. | |
| | `ucnv_getDefaultName` | Get the default converter name that is currently used by ICU and the operating system. | |
| | `ucnv_setDefaultName` | Use this function to override the default converter name. | |
| |
| Even though IANA specifies a list of aliases, it usually does not specify the |
| mappings or the actual character set for the aliases. Sometimes vendors will map |
| similar glyph variants to different Unicode code points or sometimes they will |
| assign completely different glyphs for the same codepage code point. Because of |
| these ambiguities, you can sometimes get `U_AMBIGUOUS_ALIAS_WARNING` for the |
| returned `UErrorCode` when more than one converter uses the requested alias. This |
| is only a warning, and the results can still be used. This UErrorCode value is |
| just a reminder that you may not get what you expected. The above functions can |
| help you to determine which converter you actually wanted. |
| |
| EBCDIC based converters do have the option to swap the newline and linefeed |
| character mappings. This can be useful when transferring EBCDIC documents |
| between z/OS (MVS, os/390 and the rest of the zSeries family) and another EBCDIC |
| machine like OS/400 on iSeries. The ",swaplnlf" or `UCNV_SWAP_LFNL_OPTION_STRING` |
| from ucnv.h can be appended to a converter alias in order to achieve this |
| behavior. You can view other available options in ucnv.h. |
| |
| You can always skip many of these aliasing and mapping problems by just using |
| Unicode. |
| |
| ### Creating a Converter |
| |
| There are four ways to create a converter: |
| |
| 1. **By name**: Converters can be created using different types of names. No |
| distinction is made when the converter is created, as to which name is being |
| employed. There are many types of aliases possible. Among these are |
| [IANA](http://www.iana.org/assignments/character-sets) ("shift_jis", |
| "koi8-r", or "iso-8859-3"), host specific names ("cp1252" which is the name |
| for a Microsoft® Windows™ or a similar IBM® codepage). Finally, ICU's own |
| internal canonical names for a converter can be used. These include "UTF-8" |
| or "ISO-8859-1" for built-in conversion types, and names such as |
| "ibm-949_P110-2000" (Shift-JIS with '\\' <-> '¥' mapping) or |
| "ibm-949_P11A-2000" (Shift-JIS with '\\' <-> '\\' mapping) for data-file |
| based conversions. |
| |
| ```c |
| UConverter *conv = ucnv_open("shift_jis", &myError); |
| ``` |
| |
| As a convenience, converter names can be passed in as Unicode. (for example, |
| if a user passed in the string from a Unicode-based user interface). |
| However, the actual names are restricted to an invariant ASCII/EBCDIC |
| subset. |
| |
| ```c |
| UChar *name = ...; UConverter *conv = ucnv_openU(name, &myError); |
| ``` |
| |
| Converter names are case-insensitive. In addition, beginning with ICU 3.6, |
| leading zeroes are ignored in sequences of digits (if further digits |
| follow), and all non-alphanumeric characters are ignored. Thus the strings |
| "UTF-8", "utf_8", "u\*T@f08" and "Utf 8" are equivalent. (Before ICU 3.6, |
| leading zeroes were not ignored, and only spaces, dashes and underscores |
| were ignored.) The `ucnv_compareNames()` function provides such string |
| comparisons. |
| |
| Unlike the names of resources or other types of ICU data, converter names |
| can **not** be qualified with a path that indicates the directory or common |
| data file containing the corresponding converter data. The requested |
| converter's data must be present either in the main ICU data library or as a |
| separate file located in the ICU data directory. However, you can always |
| create a package of converters with pkgdata and open a converter from the |
| package with `ucnv_openPackage()` |
| |
| ```c |
| UConverter *conv = ucnv_openPackage("./myPackage.dat", "customConverter", &myError); |
| ``` |
| |
| 2. **By number**: The design of the ICU is to accommodate codepages provided by |
| different vendors. For example, the IBM CDRA (Character Data Representation |
| Architecture which is an IBM architecture that defines a set of identifiers) |
| has an ID type called the CCSID (Coded Character Set Identifier). The ICU |
| API for opening a codepage by number must be given a vendor along with the |
| number. Currently, only IBM (`UCNV_IBM`) is supported. For example, the US |
| EBCDIC codepage (IBM #37) can be opened with the following code: |
| |
| ```c |
| ucnv_openCCSID(37, UCNV_IBM, &myErr); |
| ``` |
| |
| 3. **By iteration**: An application might not know ahead of time which codepage |
| to use, and thus might need to query ICU to determine the entire list of |
| installed converters. The ICU returns a list of its canonical (internal) |
| names. From each names, the standard IANA name can be determined, and also a |
| list of aliases which point to that name can be determined. For example, ICU |
| might return among the canonical names "ibm-367". That name itself may or |
| may not provide the application or its users with the information needed. |
| (367 is actually the decimal form of a number that is calculated by |
| appending certain hex digits together.) However, the IANA name can be |
| requested from this canonical name, which should return something like |
| "us-ascii". The alias list for ibm-367 can be iterated over as well, which |
| returns additional names like "ascii", "646", "ansi_x3.4-1968" etc. If this |
| is not sufficient information, once a converter is opened, it can be queried |
| for its type, min and max char size, etc. This information is not available |
| without actually opening the converter (a fairly lightweight process.) |
| |
| ```c |
| /* Returns count of the number of available names */ |
| int count = ucnv_countAvailable(); |
| /* get the canonical name of the 36th available converter */ |
| const char *convName1 = ucnv_getAvailableName(36); |
| /* get the 3rd alias for a given codepage. */ |
| const char *asciiAlias = ucnv_getAlias("ibm-367", 3, &myError); |
| /* Get the IANA name of the converter */ |
| const char *ascii = ucnv_getStandardName("ibm-367", "IANA"); |
| /* Get the one of the non preferred IANA name of the converter. */ |
| UEnumeration *asciiEnum = |
| ucnv_openStandardNames("ibm-367", "IANA", &myError); |
| uenum_next(asciiEnum, &myError); /* skip preferred IANA alias */ |
| /* get one of the non-preferred IANA aliases */ |
| const char *ascii2 = uenum_next(asciiEnum, &myError); |
| uenum_close(asciiEnum); |
| ``` |
| |
| 4. **By using the default converter**: The default converter can be opened by |
| passing a NULL as the name of the converter. |
| |
| ```c |
| ucnv_open(NULL, &myErr); |
| ``` |
| |
| > :point_right: **Note**: ICU chooses this converter based on the best information available to it. |
| > The purpose of this converter is to interface with the OS using a codepage (i.e. `char *`). |
| > Do not use it as a way of determining the best overall converter to use. |
| > Usually any Unicode encoding form is the best way to store and send text data, |
| > so that important data does not get lost in the conversion. |
| > Also, if the OS supports Unicode-based API's (such as Win32), |
| > it is better to use only those Unicode API's. |
| > As an example, the new Windows 2000 locales (such as Hindi) do not |
| > define the default codepage to something that supports Hindi. |
| > The default converter is used in expressions such as: `UnicodeString text("abc");` |
| > to convert 'abc', and in the `u_uastrcpy()` C functions. |
| > Code operating at the [OS level](../design.md) MAY choose to |
| > change the default converter with `ucnv_setDefaultName()`. |
| > However, be aware that this change has inconsistent results if it is done after |
| > ICU components are initialized. |
| |
| ### Closing a Converter |
| |
| Closing a converter frees memory occupied by that instance of the converter. |
| However it does not release the larger shared data tables the converter might |
| use. OS-level code may call `ucnv_flushCache()` to explicitly free memory occupied |
| by [unused tables](../design.md). |
| |
| ```c |
| ucnv_close(conv) |
| ``` |
| |
| ### Converter Life Cycle |
| |
| Note that a Converter is created with a certain type (for instance, ISO-8859-3) |
| which does not change over the life of that [object](../design.md). Converters |
| should be allocated one per thread. They are cheap to create, as the shared data |
| doesn't need to be reallocated. |
| |
| This is the typical life cycle of a converter, as shown step-by-step: |
| |
| 1. First, open up the converter with a specified name (or alias name). |
| ```c |
| UConverter *conv = ucnv_open("shift_jis", &status); |
| ``` |
| |
| 2. Target here is the `char s[]` to write into, and targetSize is how big the |
| target buffer is. Source is the UChars that are being converted. |
| ```c |
| int32_t len = ucnv_fromUChars(conv, target, targetSize, source, u_strlen(source), &status); |
| ``` |
| |
| 3. Clean up the converter. |
| ```c |
| ucnv_close(conv); |
| ``` |
| |
| ### Sharing Converters Between Threads |
| |
| A converter cannot be shared between threads at the same time. However, if it is |
| reset it can be used for unrelated chunks of data. For example, use the same |
| converter for converting data from Unicode to ISO-8859-3, and then reset it. Use |
| the same converter for converting data from ISO-8859-3 back into Unicode. |
| |
| ### Converting Large Quantities of Data |
| |
| If it is necessary to convert a large quantity of data in smaller buffers, use |
| the same converter to convert each buffer. This will make sure any state is |
| preserved from one chunk to the next. Doing this conversion is known as |
| streaming or buffering, and is mentioned [Buffered or Streamed](#3-buffered-or-streamed) |
| section (§) later in this chapter. |
| |
| ### Cloning a Converter |
| |
| Cloning a converter returns a clone of the converter object along with any |
| internal state that the converter might be storing. Cloning routines must be |
| used with extreme care when using converters for stateful or multibyte |
| encodings. If the converter object is carrying an internal state, and the |
| newly-created clone is used to convert a new chunk of text, the converter |
| produces incorrect results. Also note that the caller owns the cloned object and |
| has to call `ucnv_close()` to dispose of the object. Calling `ucnv_reset()` before |
| cloning will reset the converter to its original state. |
| |
| ```c |
| UConverter* newCnv = ucnv_safeClone(oldCnv, 0, &bufferSize, &err) |
| ``` |
| |
| ## Converter Behavior |
| |
| ### Conversion |
| |
| 1. The converters always consume the source buffer as far as possible, and |
| advance the source pointer. |
| |
| 2. The converters write to the target all converted output as far as possible, |
| and then write any remaining output to the internal services buffer. When |
| the conversion routines are called again, the internal buffer is flushed out |
| and written to the target buffer before proceeding with any further |
| conversion. |
| |
| 3. In conversions to Unicode from Multi-byte encodings or conversions from |
| Unicode involving surrogates, if (a) only a partial byte sequence is |
| retrieved from the source buffer, (b) the "flush" parameter is set to "TRUE" |
| and (c) the end of source is reached, then the callback is called with |
| `U_TRUNCATED_CHAR_FOUND`. |
| |
| ### Reset |
| |
| Converters can be reset explicitly or implicitly. Explicit reset is done by |
| calling: |
| |
| 1. `ucnv_reset()`: Resets the converter to initial state in both directions. |
| |
| 2. `ucnv_resetToUnicode()`: Resets the converter to initial state to Unicode |
| direction. |
| |
| 3. `ucnv_resetFromUnicode()`: Resets the converter to initial state from Unicode |
| direction. |
| |
| The converters are reset implicitly when the conversion functions are called |
| with the "flush" parameter set to "TRUE" and the source is consumed. |
| |
| ### Error |
| |
| #### Conversion from Unicode |
| |
| Not all characters can be converted from Unicode to other codepages. In most |
| cases, Unicode is a superset of the characters supported by any given codepage. |
| |
| The default behavior of ICU in this case is to substitute the illegal or |
| unmappable sequence, with the appropriate substitution sequence for that |
| codepage. For example, ISO-8859-1, along with most ASCII-based codepages, has |
| the character 0x1A (Control-Z) as the substitution sequence. When converting |
| from Unicode to ISO-8859-1, any characters which cannot be converted would be |
| replaced by 0x1A's. |
| |
| SubChar1 is sometimes used as substitution character in MBCS conversions. For |
| more information on SubChar1 please see the [Conversion Data](data.md) chapter. |
| |
| In stateful converters like ISO-2022-JP, if a substitution character has to be |
| written to the target, then an escape/shift sequence to change the state to |
| single byte mode followed by a substitution character is written to the target. |
| |
| The substitution character can be changed by calling the `ucnv_setSubstChars()` |
| function with the desired codepage byte sequence. However, this has some |
| limitations: It only allows setting a single character (although the character |
| can consist of multiple bytes), and it may not work properly for some stateful |
| converters (like HZ or ISO 2022 variants) when setting a multi-byte substitution |
| character. (It will work for EBCDIC_STATEFUL ones.) Moreover, for setting a |
| particular character, the caller needs to know the correct byte sequence for |
| that character in the converter's codepage. (For example, a space (U+0020) is |
| encoded as 0x20 in ASCII-based codepages, 0x40 in EBCDIC-based ones, 0x00 0x20 |
| or 0x20 0x00 in UTF-16 depending on the stream's endianness, etc.) |
| |
| The `ucnv_setSubstString()` function (new in ICU 3.6) lifts these limitations. It |
| takes a Unicode string and verifies that it can be converted to the codepage |
| without error and that it is not too long (32 bytes as of ICU 3.6). The string |
| can contain zero, one or more characters. An empty string has the effect of |
| using the skip callback. See the Error Callbacks below. Stateful converters are |
| fully supported. The same Unicode string will give equivalent results with all |
| converters that support its conversion. |
| |
| Internally, `ucnv_setSubstString()` stores the byte sequence from the test |
| conversion if the converter is stateless, or the Unicode string itself if the |
| converter is stateful. If the Unicode string is stored, then it is converted on |
| the fly during substitution, handling all state transitions. |
| |
| The function `ucnv_getSubstChars()` can be used to retrieve the substitution byte |
| sequence if it is the default one, set by `ucnv_setSubstChars()`, or if |
| `ucnv_setSubstString()` stored the byte sequence for a stateless converter. The |
| Unicode string set for a stateful converter cannot be retrieved. |
| |
| #### Conversion to Unicode |
| |
| In conversion to Unicode, errors are normally due to ill-formed byte sequences: |
| Unused byte values, or lead bytes not followed by trail bytes according to the |
| encoding scheme. Well-formed but unmappable sequences are unusual but possible. |
| |
| The ICU default behavior is to emit an `U+FFFD REPLACEMENT CHARACTER` per |
| offending sequence. |
| |
| If the conversion table .ucm file contains a `<subchar1>` entry (such as in the |
| ibm-943 table), a U+001A C0 control ("SUB") is emitted for single-byte |
| illegal/unmappable input rather than `U+FFFD REPLACEMENT CHARACTER`. For details |
| on this behavior look for "001A" in the [Conversion Data](data.md) chapter. |
| |
| * This behavior originates from mainframes with dedicated single-byte-to-single-byte |
| and double-to-double conversions. |
| * Emitting U+001A for single-byte errors can be avoided by (a) removing the |
| `<subchar1>` mapping or (b) using a similar conversion table that does not |
| have this mapping (e.g., windows-932 instead of ibm-943) or (c) writing a |
| custom callback function. |
| |
| ### Error Codes |
| |
| Here are some of the `UErrorCode`s which have significant meaning for conversion: |
| |
| #### U_INDEX_OUTOFBOUNDS_ERROR |
| |
| In `getNextUChar()` - all source data |
| has been consumed without producing a Unicode character |
| |
| #### U_INVALID_CHAR_FOUND |
| No mapping was found from the source to the target encoding. For example, U+0398 |
| (Capital Theta) has no mapping into ISO-8859-1, and so U_INVALID_CHAR_FOUND |
| will result. |
| |
| #### U_TRUNCATED_CHAR_FOUND |
| |
| All of the source data was read, and a |
| character sequence was incomplete. For example, only half of a double-byte |
| sequence may have been encountered. When converting FROM Unicode, this error |
| would occur when a conversion ends with a low surrogate (U+D800) at the end of |
| the source, with no corresponding high surrogate. |
| |
| #### U_ILLEGAL_CHAR_FOUND |
| |
| A character sequence was found in the source which is disallowed in the source |
| encoding scheme. For example, many MBCS encodings have only certain byte |
| sequences which are allowed as lead bytes. When converting from Unicode, if a |
| low surrogate is NOT followed immediately by a high surrogate, or a high |
| surrogate without its preceding low surrogate, an illegal sequence results. |
| Note: Most, but not all, converters forbid surrogate code points or unpaired |
| surrogate code units. (Lead surrogate without trail, or trail without lead.) |
| Some converters permit surrogate code points/unpaired surrogates because their |
| charset specification permits it. For example, LMBCS, SCSU and |
| BOCU-1. |
| |
| #### U_INVALID_TABLE_FORMAT |
| |
| An error occurred trying to read the backing data |
| for the converter. The data could be corrupt, or the wrong |
| version. |
| |
| #### U_BUFFER_OVERFLOW_ERROR |
| |
| More output (target) characters were produced |
| than fit in the target buffer. If in `to/fromUnicode()`, then process the target |
| buffer and call the function again to retrieve the overflowed characters. |
| |
| ### Error Callbacks |
| |
| What actually happens is that an "error callback function" is called at the |
| point where the conversion failure occurred. The function can deal with the |
| failed characters as it sees fit. Possible options at the callback's disposal |
| include ignoring the bad sequence, converting it to a different sequence, and |
| returning an error to the caller. The callback can also consume any data past |
| where the error occurred, whether or not that data would have caused an error. |
| Only one callback is installed at a time, per direction (to or from unicode). |
| |
| A number of canned functions are provided by ICU, and an application can write |
| new ones. The "callbacks" are either From Unicode (to codepage), or To Unicode |
| (from codepage). Here is a list of the canned callbacks in ICU: |
| |
| 1. UCNV_**FROM_U**_CALLBACK_SUBSTITUTE: This callback is installed by default. |
| It will write the codepage's substitute sequence or a user-set substitute |
| sequence, or convert a user-set substitute UnicodeString to the codepage. |
| See "Error / Conversion from Unicode" above. |
| |
| 2. UCNV_**TO_U**_CALLBACK_SUBSTITUTE: This callback is installed by default. It |
| will write U+FFFD or sometimes U+001A. See "Error / Conversion to Unicode" |
| above. |
| |
| 3. UCNV_FROM_U_CALLBACK_SKIP, UCNV_TO_U_CALLBACK_SKIP: Simply ignores any |
| invalid characters in the input, no error is returned. |
| |
| 4. UCNV_FROM_U_CALLBACK_STOP, UCNV_TO_U_CALLBACK_STOP: Stop at the error. |
| Return the error to the caller. (When using the 'BUFFER' mode of conversion, |
| the source and target pointers returned can be examined to determine where |
| the error occurred. `ucnv_getInvalidUChars()` and `ucnv_getInvalidChars()` |
| return the actual text which failed). |
| |
| 5. UCNV_FROM_U_CALLBACK_ESCAPE, UCNV_TO_U_CALLBACK_ESCAPE: This callback is |
| especially useful for debugging. Missing codepage characters are replaced by |
| strings such as '%U094D' with the Unicode value, and missing Unicode chars |
| are replaced with text of the form '%X0A' where the codepage had the |
| unconvertible byte hex 0A. |
| |
| When a callback is set, a "context" pointer is also provided. How this |
| pointer is created depends on the specific callback. There is usually a |
| `createContext()` function for that specific callback, where the caller can |
| set certain options for the callback. Consult the documentation for the |
| specific callback you are using. For ICU's canned callbacks, this pointer |
| may be set to NULL. The functions for setting a different callback also |
| return the old callback, and the old context pointer. These may be stored so |
| that the old callback is re-installed when an operation is finished. |
| |
| Additionally the following options can be passed as the context parameter to |
| UCNV_FROM_U_CALLBACK_ESCAPE callback function to produce different outputs. |
| |
| | UCNV_ESCAPE_ICU | %U12345 | |
| | ------------------- | ------- | |
| | UCNV_ESCAPE_JAVA | \\u1234 | |
| | UCNV_ESCAPE_C | \\udbc9\\udd36 for Plane 1 and \\u1234 for Plane 0 codepoints | |
| | UCNV_ESCAPE_XML_DEC | \ᅬ number expressed in Decimal | |
| | UCNV_ESCAPE_XML_HEX | \ሴ number expressed in Hexadecimal | |
| |
| Here are some examples of how to use callbacks. |
| |
| ```c |
| UConverter *u; |
| void *oldContext, *newContext; |
| UConverterFromUCallback oldAction, newAction; |
| u = ucnv_open("shift_jis", &myError); |
| |
| ... /* do some conversion with u from unicode.. */ |
| |
| ucnv_setFromUCallBack( |
| u, MY_FROMU_CALLBACK, newContext, &oldAction, &oldContext, &myError); |
| |
| ... /* do some other conversion from unicode */ |
| |
| /* Now, set the callback back */ |
| ucnv_setFromUCallBack( |
| u, oldAction, oldContext, &newAction, &newContext, &myError); |
| |
| ``` |
| |
| ### Custom Callbacks |
| |
| Writing a callback is somewhat involved, and will be covered more completely in |
| a future version of this document. One might look at the source to the provided |
| callbacks as a starting point, and address any further questions to the mailing |
| list. |
| |
| Basically, callback, unlike other ICU functions which expect to be called with |
| `U_ZERO_ERROR` as the input, is called in an exceptional error condition. The |
| callback is a kind of 'last ditch effort' to rectify the error which occurred, |
| before it is returned back to the caller. This is why the implementation of STOP |
| is very simple: |
| |
| ```c |
| void UCNV_FROM_U_CALLBACK_STOP(...) { } |
| ``` |
| |
| The error code such as `U_INVALID_CHAR_FOUND` is returned to the user. If the |
| callback determines that no error should be returned to the user, then the |
| callback must set the error code to `U_ZERO_ERROR`. Note that this is a departure |
| from most ICU functions, which are supposed to check the error code and return |
| immediately if it is set. |
| |
| > :point_right: **Note**: See the functions `ucnv_cb_write...()` for |
| > functions which a callback may use to perform its task. |
| |
| #### Ignore Default_Ignorable_Code_Point |
| |
| Unicode has a number of characters that are not by themselves meaningful but |
| assist with line breaking (e.g., U+00AD Soft Hyphen & U+200B Zero Width Space), |
| bi-directional text layout (U+200E Left-To-Right Mark), collation and other |
| algorithms (U+034F Combining Grapheme Joiner), or indicate a preference for a |
| particular glyph variant (U+FE0F Variation Selector 16). These characters are |
| "invisible" by default, that is, they should normally not be shown with a glyph |
| of their own, except in special circumstances. Examples include showing a hyphen |
| for when a Soft Hyphen was used for a line break, or modifying the glyph of a |
| character preceding a Variation Selector. |
| |
| Unicode has a character property to identify such characters, as well as |
| currently-unassigned code points that are intended to be used for similar |
| purposes: Default_Ignorable_Code_Point, or "DI" for short: |
| http://www.unicode.org/cldr/utility/list-unicodeset.jsp?a=[:DI:] |
| |
| Most charsets do not have most or any of these characters. |
| |
| **ICU 54 and above by default skip default-ignorable code points if they are |
| unmappable**. (Ticket #[10551](https://unicode-org.atlassian.net/browse/ICU-10551)) |
| |
| **Older versions of ICU** replaced unmappable default-ignorable code points like |
| any other unmappable code points, by a question mark or whatever substitution |
| character is defined for the charset. |
| |
| For best results, a custom from-Unicode callback can be used to ignore |
| Default_Ignorable_Code_Point characters that cannot be converted, so that they |
| are removed from the charset output rather than replaced by a visible character. |
| |
| This is a code snippet for use in a custom from-Unicode callback: |
| |
| ```c |
| #include "unicode/uchar.h" |
| // ... |
| (from-Unicode callback) |
| switch(reason) { |
| case UCNV_UNASSIGNED: |
| if(u_hasBinaryProperty(codePoint, UCHAR_DEFAULT_IGNORABLE_CODE_POINT)) { |
| // Ignore/drop default ignorable code points that cannot be converted, |
| // rather than treating them like errors/writing a substitution character etc. |
| // For example, U+200B Zero Width Space, |
| // U+200E Left-To-Right Mark, U+FE0F Variation Selector 16. |
| *pErrorCode = U_ZERO_ERROR; |
| return; |
| } else { |
| // ... |
| ``` |
| |
| ## Modes of Conversion |
| |
| When a converter is instantiated, it can be used to convert both in the Unicode |
| to Codepage direction, and also in the Codepage to Unicode direction. There are |
| three ways to use the converters, as well as a convenience function which does |
| not require the instantiation of a converter. |
| |
| 1. **Single-String**: Simplest type of conversion to or from Unicode. The data |
| is entirely contained within a single string. |
| |
| 2. **Character**: Converting from the codepage to a single Unicode codepoint, |
| one at a time. |
| |
| 3. **Buffer**: Convert data which may not fit entirely within a single buffer. |
| Usually the most efficient and flexible. |
| |
| 4. **Convenience**: Convert a single buffer from one codepage to another |
| through Unicode, without requiring the instantiation of a converter. |
| |
| ### 1. Single-String |
| |
| Data must be contained entirely within a single string or buffer. |
| |
| ```c |
| conv = ucnv_open("shift_jis", &status); |
| |
| /* Convert from Unicode to Shift JIS */ |
| len = ucnv_fromUChars(conv, target, targetLen, source, sourceLen, &status); |
| ucnv_close(conv); |
| |
| conv = ucnv_open("iso-8859-3", &status); |
| /* Convert from ISO-8859-3 to Unicode */ |
| len = ucnv_toUChars(conv, target, targetSize, source, sourceLen, &status); |
| ucnv_close(conv); |
| ``` |
| |
| ### 2. Character |
| |
| In this type, the input data is in the specified codepage. With each function |
| call, only the next Unicode codepoint is converted at a time. This might be the |
| most efficient way to scan for a certain character, or other processing of a |
| single character at a time, because converters are stateful. This works even for |
| multibyte charsets, and for stateful ones such as iso-2022-jp. |
| |
| ```c |
| conv = ucnv_open("Big-5", &status); |
| UChar32 target; |
| while(source < sourceLimit) { |
| target = ucnv_getNextUChar(conv, &source, sourceLimit, &status); |
| ASSERT(status); |
| processChar(target); |
| } |
| ``` |
| |
| ### 3. Buffered or Streamed |
| |
| This is used in situations where a large document may be read in off of disk and |
| processed. Also, many codepages take multiple bytes to encode a character, or |
| have state. These factors make it impossible to convert arbitrary chunks of data |
| without maintaining state across chunks. Even conversion from Unicode may |
| encounter a leading surrogate at the end of one buffer, which needs to be paired |
| with the trailing surrogate in the next buffer. |
| |
| A basic API principle of the ICU to/from Unicode functions is that they will |
| ALWAYS attempt to consume all of the input (source) data, unless the output |
| buffer is full or some other error occurs. In other words, there is no need to |
| ever test whether all of the source data has been consumed. |
| |
| The basic loop that is used with the ICU buffer conversion routines is the same |
| in the to and from Unicode directions. In the following pseudocode, either |
| 'source' (for fromUnicode) or 'target' (for toUnicode) are UTF-16 UChars. |
| |
| ```c |
| UErrorCode err = U_ZERO_ERROR; |
| |
| while (... /*input data available*/ ) { |
| ... /* read input data into buffer */ |
| |
| source = ... /* beginning of read data */; |
| sourceLimit = source + readLength; // end + 1 |
| |
| UBool flush = (further input data still available) // (i.e. feof()) |
| |
| /* loop until all source has been processed */ |
| do { |
| /* set up target pointers */ |
| target = ... /* beginning of output buffer */; |
| targetLimit = target + sizeOfOutput; |
| |
| err = U_ZERO_ERROR; /* so that the to/from does not fail */ |
| |
| ucnv_to/fromUnicode(converter, &target, targetLimit, |
| &source, sourceLimit, NULL, flush, &err); |
| |
| ... /* write (target-beginningOfOutputBuffer) items |
| starting at beginning of output buffer */ |
| } while (err == U_BUFFER_OVERFLOW_ERROR); |
| if(U_FAILURE(error)) { |
| ... /* process error */ |
| break; /* out of the 'while' loop that reads source data */ |
| } |
| } |
| /* loop to read input data */ |
| if(U_FAILURE(error)) { |
| ... /* process error further */ |
| } |
| ``` |
| |
| The above code optimizes for processing entire chunks of input data. An |
| efficient size for the output buffer can be calculated as follows. (in bytes): |
| |
| ```c |
| ucnv_getMinCharSize() * inputBufferSize * sizeof(UChar) |
| ucnv_getMaxCharSize() * inputBufferSize |
| ``` |
| |
| There are two loops used, an outer and an inner. The outer loop fetches input |
| data to keep the source buffer full, and the inner loop 'writes' out data to |
| keep the output buffer empty. |
| |
| Note that while this efficiently handles data on the input side, there are some |
| cases where the size of the output buffer is fixed. For instance, in network |
| applications it is sometimes desirable to fill every output packet completely |
| (not including the last packet in the sequence). The above loop does not ensure |
| that every output buffer is completely full. For example, if a 4 UChar input |
| buffer was used, and a 3 byte output buffer with `fromUnicode()`, the loop would |
| typically write 3 bytes, then 1, then 3, and so on. If, instead of efficient use |
| of the input data, the goal is filling output buffers, a slightly different loop |
| can be used. |
| |
| In such a scenario, the inner write does not occur unless a buffer overflow |
| occurs OR 'flush' is true. So, the 'write' and resetting of the target and |
| targetLimit pointers would only happen |
| `if (err == U_BUFFER_OVERFLOW_ERROR || flush == TRUE)` |
| |
| The flush parameter on each conversion call should be set to FALSE, until the |
| conversion call is called for the last time for the buffer. This is because the |
| conversion is stateful. On the last conversion call, the flush parameter should |
| be set to TRUE. More details are mentioned in the API reference in |
| [ucnv.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html) . |
| |
| ### 4. Pre-flighting |
| |
| Preflighting is the process of asking the conversion API for the size of target |
| buffer required. (For a more general discussion, see the Preflighting section |
| (§) in the [Strings](../strings/index.md) chapter.) |
| |
| This is accomplished by calling the `ucnv_fromUChars` and `ucnv_toUChars` functions. |
| |
| ```c |
| UChar uchar2; |
| char input_char_buffer = "This is some text"; |
| |
| targetsize = ucnv_toUChars(myConverter, NULL, targetcapacity, |
| input_char_buffer, sizeof(input_char_buffer), &err); |
| |
| if(err==U_BUFFER_OVERFLOW_ERROR) { |
| err=U_ZERO_ERROR; |
| uchar2=(UChar*)malloc((targetsize) * sizeof(UChar)); |
| targetsize = ucnv_toUChars(myConverter, uchar2, targetsize, |
| input_char_buffer, sizeof(input_char_buffer), &err); |
| if(U_FAILURE(err)) { |
| printf("ucnv_toUChars() FAILED %s\n", myErrorName(err)); |
| } |
| else { |
| printf("ucnv_toUChars() o.k.\n"); |
| } |
| } |
| ``` |
| |
| > :point_right: **Note**: *This is inefficient since the conversion is performed |
| > **twice**, once for finding the size of target and once for writing to the target*. |
| |
| ### 5. Convenience |
| |
| ICU provides some convenience functions for conversions: |
| |
| ```c |
| ucnv_toUChars(myConverter, target_uchars, targetsize, |
| input_char_buffer, sizeof(input_char_buffer), &err); |
| ucnv_fromUChars(cnv, cTarget, (cTargetLimit-cTarget), |
| uSource, (uSourceLimit-uSource), &errorCode); |
| |
| char target[100]; |
| UnicodeString str("ABCDEF", "iso-8859-1"); |
| int32_t targetsize = str.extract(0, str.length(), target, sizeof(target), "SJIS"); |
| target[targetsize] = 0; /* NULL termination */ |
| ``` |
| |
| ## Conversion Examples |
| |
| See the [ICU Conversion Examples](https://github.com/unicode-org/icu/blob/master/icu4c/source/samples/ucnv/convsamp.cpp) for more information. |