| --- | 
 | layout: default | 
 | title: UTF-8 | 
 | nav_order: 1 | 
 | parent: Chars and Strings | 
 | --- | 
 | <!-- | 
 | © 2020 and later: Unicode, Inc. and others. | 
 | License & terms of use: http://www.unicode.org/copyright.html | 
 | --> | 
 |  | 
 | # UTF-8 | 
 |  | 
 | *Note: This page is only relevant for C/C++. In Java, all strings are encoded in | 
 | UTF-16, except for conversion from bytes to strings (via InputStreamReader or | 
 | similar) and from strings to bytes (OutputStreamWriter etc.).* | 
 |  | 
 | While most of ICU works with UTF-16 strings and uses data structures optimized | 
 | for UTF-16, there are APIs that facilitate working with UTF-8, or are optimized | 
 | for UTF-8, or work with Unicode code points (21-bit integer values) regardless | 
 | of string encoding. Some data structures are designed to work equally well with | 
 | UTF-16 and UTF-8. | 
 |  | 
 | For UTF-8 strings, ICU normally uses `(const) char *` pointers and `int32_t` | 
 | lengths, normally with semantics parallel to UTF-16 handling. (Input length=-1 | 
 | means NUL-terminated, output is NUL-terminated if there is space, output | 
 | overflow is handled with preflighting; for details see the parent [Strings | 
 | page](index.md).) Some newer APIs take an `icu::StringPiece` argument and write | 
 | to an `icu::ByteSink` or to a string class object like `std::string`. | 
 |  | 
 | ## Conversion Between UTF-8 and UTF-16 | 
 |  | 
 | The simplest way to use UTF-8 strings in UTF-16 APIs is via the C++ | 
 | `icu::UnicodeString` methods `fromUTF8(const StringPiece &utf8)` and | 
 | `toUTF8String(StringClass &result)`. There is also `toUTF8(ByteSink &sink)`. | 
 |  | 
 | In C, `unicode/ustring.h` has functions like `u_strFromUTF8WithSub()` and | 
 | `u_strToUTF8WithSub()`. (Also `u_strFromUTF8()`, `u_strToUTF8()` and | 
 | `u_strFromUTF8Lenient()`.) | 
 |  | 
 | The conversion functions in `unicode/ucnv.h` are intended for very flexible | 
 | handling of conversion to/from external byte streams (with customizable error | 
 | handling and support for split buffers at arbitrary boundaries) which is | 
 | normally unnecessary for internal strings. | 
 |  | 
 | Note: `icu::``UnicodeString` has constructors, `setTo()` and `extract()` methods | 
 | which take either a converter object or a charset name. These can be used for | 
 | UTF-8, but are not as efficient or convenient as the | 
 | `fromUTF8()`/`toUTF8()`/`toUTF8String()` methods mentioned above. (Among | 
 | conversion methods, APIs with a charset name are more convenient but internally | 
 | open and close a converter; ones with a converter object parameter avoid this.) | 
 |  | 
 | ## UTF-8 as Default Charset | 
 |  | 
 | ICU has many functions that take or return `char *` strings that are assumed to | 
 | be in the default charset which should match the system encoding. Since this | 
 | could be one of many charsets, and the charset can be different for different | 
 | processes on the same system, ICU uses its conversion framework for converting | 
 | to and from UTF-16. | 
 |  | 
 | If it is known that the default charset is always UTF-8 on the target platform, | 
 | then you should `#define`` U_CHARSET_IS_UTF8 1` in or before `unicode/utypes.h`. | 
 | (For example, modify the default value there or pass `-D``U_CHARSET_IS_UTF8=1` | 
 | as a compiler flag.) This will change most of the implementation code to use | 
 | dedicated (simpler, faster) UTF-8 code paths and avoid dependencies on the | 
 | conversion framework. (Avoiding such dependencies helps with statically linked | 
 | libraries and may allow the use of `UCONFIG_NO_LEGACY_CONVERSION` or even | 
 | `UCONFIG_NO_CONVERSION` \[see `unicode/uconfig.h`\].) | 
 |  | 
 | ## Low-Level UTF-8 String Operations | 
 |  | 
 | `unicode/utf8.h` defines macros for UTF-8 with semantics parallel to the UTF-16 | 
 | macros in `unicode/utf16.h`. The macros handle many cases inline, but call | 
 | internal functions for complicated parts of the UTF-8 encoding form. For | 
 | example, the following code snippet counts white space characters in a string: | 
 |  | 
 | ```c | 
 | #include "unicode/utypes.h" | 
 | #include "unicode/stringpiece.h" | 
 | #include "unicode/utf8.h" | 
 | #include "unicode/uchar.h" | 
 |  | 
 | int32_t countWhiteSpace(StringPiece sp) { | 
 |     const char *s=sp.data(); | 
 |     int32_t length=sp.length(); | 
 |     int32_t count=0; | 
 |     for(int32_t i=0; i<length;) { | 
 |         UChar32 c; | 
 |         U8_NEXT(s, i, length, c); | 
 |         if(u_isUWhiteSpace(c)) { | 
 |             ++count; | 
 |         } | 
 |     } | 
 |     return count; | 
 | } | 
 | ``` | 
 |  | 
 | ## Dedicated UTF-8 APIs | 
 |  | 
 | ICU has some APIs dedicated for UTF-8. They tend to have been added for "worker | 
 | functions" like comparing strings, to avoid the string conversion overhead, | 
 | rather than for "builder functions" like factory methods and attribute setters. | 
 |  | 
 | For example, `icu::Collator::compareUTF8()` compares two UTF-8 strings | 
 | incrementally, without converting all of the two strings to UTF-16 if there is | 
 | an early base letter difference. | 
 |  | 
 | `ucnv_convertEx()` can convert between UTF-8 and another charset, if one of the | 
 | two `UConverter`s is a UTF-8 converter. The conversion *from UTF-8 to* most | 
 | other charsets uses a dedicated, optimized code path, avoiding the pivot through | 
 | UTF-16. (Conversion *from* other charsets *to UTF-8* could be optimized as well, | 
 | but that has not been implemented yet as of ICU 4.4.) | 
 |  | 
 | Other examples: (This list may or may not be complete.) | 
 |  | 
 | *   ucasemap_utf8ToLower(), ucasemap_utf8ToUpper(), ucasemap_utf8ToTitle(), | 
 |     ucasemap_utf8FoldCase() | 
 | *   ucnvsel_selectForUTF8() | 
 | *   icu::UnicodeSet::spanUTF8(), spanBackUTF8() and uset_spanUTF8(), | 
 |     uset_spanBackUTF8() (These are highly optimized for UTF-8 processing.) | 
 | *   ures_getUTF8String(), ures_getUTF8StringByIndex(), ures_getUTF8StringByKey() | 
 | *   uspoof_checkUTF8(), uspoof_areConfusableUTF8(), uspoof_getSkeletonUTF8() | 
 |  | 
 | ## Abstract Text APIs | 
 |  | 
 | ICU offers several interfaces for text access, designed for different use cases. | 
 | (Some interfaces are simply newer and more modern than others.) Some ICU | 
 | services work with some of these interfaces, and for some of these interfaces | 
 | ICU offers UTF-8 implementations out of the box. | 
 |  | 
 | `UText` can be used with `BreakIterator` APIs (character/word/sentence/... | 
 | segmentation). `utext_openUTF8()` creates a read-only `UText` for a UTF-8 | 
 | string. | 
 |  | 
 | *   *Note: In ICU 4.4 and before, BreakIterator only works with UTF-8 (or any | 
 |     other charset with non-1:1 index conversion to UTF-16) if no dictionary is | 
 |     supported. This excludes Thai word break. See [ticket #5532](https://unicode-org.atlassian.net/browse/ICU-5532).* | 
 | *   *As a workaround for Thai word breaking, you can convert the string to | 
 |     UTF-16 and convert indexes to UTF-8 string indexes via | 
 |     `u_strToUTF8(dest=NULL, destCapacity=0, *destLength gets UTF-8 index).`* | 
 | *   *ICU 4.4 has a technology preview for UText in the regular expression API, | 
 |     but some of the UText regex API and semantics are likely to change for ICU | 
 |     4.6. (Especially indexing semantics.)* | 
 |  | 
 | A `UCharIterator` can be used with several collation APIs (although there is | 
 | also the newer `icu::Collator::compareUTF8()`) and with `u_strCompareIter()`. | 
 | `uiter_setUTF8()` creates a UCharIterator for a UTF-8 string. | 
 |  | 
 | It is also possible to create a `CharacterIterator` subclass for UTF-8 strings, | 
 | but `CharacterIterator` has a lot of virtual methods and it requires UTF-16 | 
 | string index semantics. |