docs/userguide/strings/utf-8.md - external/github.com/unicode-org/icu - Git at Google

 ---
 layout: default
 title: UTF-8
 nav_order: 1
 parent: Chars and Strings
 ---
 <!--
 © 2020 and later: Unicode, Inc. and others.
 License & terms of use: http://www.unicode.org/copyright.html
 -->

 # UTF-8

 *Note: This page is only relevant for C/C++. In Java, all strings are encoded in
 UTF-16, except for conversion from bytes to strings (via InputStreamReader or
 similar) and from strings to bytes (OutputStreamWriter etc.).*

 While most of ICU works with UTF-16 strings and uses data structures optimized
 for UTF-16, there are APIs that facilitate working with UTF-8, or are optimized
 for UTF-8, or work with Unicode code points (21-bit integer values) regardless
 of string encoding. Some data structures are designed to work equally well with
 UTF-16 and UTF-8.

 For UTF-8 strings, ICU normally uses `(const) char *` pointers and `int32_t`
 lengths, normally with semantics parallel to UTF-16 handling. (Input length=-1
 means NUL-terminated, output is NUL-terminated if there is space, output
 overflow is handled with preflighting; for details see the parent [Strings
 page](index.md).) Some newer APIs take an `icu::StringPiece` argument and write
 to an `icu::ByteSink` or to a string class object like `std::string`.

 ## Conversion Between UTF-8 and UTF-16

 The simplest way to use UTF-8 strings in UTF-16 APIs is via the C++
 `icu::UnicodeString` methods `fromUTF8(const StringPiece &utf8)` and
 `toUTF8String(StringClass &result)`. There is also `toUTF8(ByteSink &sink)`.

 In C, `unicode/ustring.h` has functions like `u_strFromUTF8WithSub()` and
 `u_strToUTF8WithSub()`. (Also `u_strFromUTF8()`, `u_strToUTF8()` and
 `u_strFromUTF8Lenient()`.)

 The conversion functions in `unicode/ucnv.h` are intended for very flexible
 handling of conversion to/from external byte streams (with customizable error
 handling and support for split buffers at arbitrary boundaries) which is
 normally unnecessary for internal strings.

 Note: `icu::``UnicodeString` has constructors, `setTo()` and `extract()` methods
 which take either a converter object or a charset name. These can be used for
 UTF-8, but are not as efficient or convenient as the
 `fromUTF8()`/`toUTF8()`/`toUTF8String()` methods mentioned above. (Among
 conversion methods, APIs with a charset name are more convenient but internally
 open and close a converter; ones with a converter object parameter avoid this.)

 ## UTF-8 as Default Charset

 ICU has many functions that take or return `char *` strings that are assumed to
 be in the default charset which should match the system encoding. Since this
 could be one of many charsets, and the charset can be different for different
 processes on the same system, ICU uses its conversion framework for converting
 to and from UTF-16.

 If it is known that the default charset is always UTF-8 on the target platform,
 then you should `#define`` U_CHARSET_IS_UTF8 1` in or before `unicode/utypes.h`.
 (For example, modify the default value there or pass `-D``U_CHARSET_IS_UTF8=1`
 as a compiler flag.) This will change most of the implementation code to use
 dedicated (simpler, faster) UTF-8 code paths and avoid dependencies on the
 conversion framework. (Avoiding such dependencies helps with statically linked
 libraries and may allow the use of `UCONFIG_NO_LEGACY_CONVERSION` or even
 `UCONFIG_NO_CONVERSION` \[see `unicode/uconfig.h`\].)

 ## Low-Level UTF-8 String Operations

 `unicode/utf8.h` defines macros for UTF-8 with semantics parallel to the UTF-16
 macros in `unicode/utf16.h`. The macros handle many cases inline, but call
 internal functions for complicated parts of the UTF-8 encoding form. For
 example, the following code snippet counts white space characters in a string:

 ```c
 #include "unicode/utypes.h"
 #include "unicode/stringpiece.h"
 #include "unicode/utf8.h"
 #include "unicode/uchar.h"

 int32_t countWhiteSpace(StringPiece sp) {
     const char *s=sp.data();
     int32_t length=sp.length();
     int32_t count=0;
     for(int32_t i=0; i<length;) {
         UChar32 c;
         U8_NEXT(s, i, length, c);
         if(u_isUWhiteSpace(c)) {
             ++count;
         }
     }
     return count;
 }
 ```

 ## Dedicated UTF-8 APIs

 ICU has some APIs dedicated for UTF-8. They tend to have been added for "worker
 functions" like comparing strings, to avoid the string conversion overhead,
 rather than for "builder functions" like factory methods and attribute setters.

 For example, `icu::Collator::compareUTF8()` compares two UTF-8 strings
 incrementally, without converting all of the two strings to UTF-16 if there is
 an early base letter difference.

 `ucnv_convertEx()` can convert between UTF-8 and another charset, if one of the
 two `UConverter`s is a UTF-8 converter. The conversion *from UTF-8 to* most
 other charsets uses a dedicated, optimized code path, avoiding the pivot through
 UTF-16. (Conversion *from* other charsets *to UTF-8* could be optimized as well,
 but that has not been implemented yet as of ICU 4.4.)

 Other examples: (This list may or may not be complete.)

 *   ucasemap_utf8ToLower(), ucasemap_utf8ToUpper(), ucasemap_utf8ToTitle(),
     ucasemap_utf8FoldCase()
 *   ucnvsel_selectForUTF8()
 *   icu::UnicodeSet::spanUTF8(), spanBackUTF8() and uset_spanUTF8(),
     uset_spanBackUTF8() (These are highly optimized for UTF-8 processing.)
 *   ures_getUTF8String(), ures_getUTF8StringByIndex(), ures_getUTF8StringByKey()
 *   uspoof_checkUTF8(), uspoof_areConfusableUTF8(), uspoof_getSkeletonUTF8()

 ## Abstract Text APIs

 ICU offers several interfaces for text access, designed for different use cases.
 (Some interfaces are simply newer and more modern than others.) Some ICU
 services work with some of these interfaces, and for some of these interfaces
 ICU offers UTF-8 implementations out of the box.

 `UText` can be used with `BreakIterator` APIs (character/word/sentence/...
 segmentation). `utext_openUTF8()` creates a read-only `UText` for a UTF-8
 string.

 *   *Note: In ICU 4.4 and before, BreakIterator only works with UTF-8 (or any
     other charset with non-1:1 index conversion to UTF-16) if no dictionary is
     supported. This excludes Thai word break. See [ticket #5532](https://unicode-org.atlassian.net/browse/ICU-5532).*
 *   *As a workaround for Thai word breaking, you can convert the string to
     UTF-16 and convert indexes to UTF-8 string indexes via
     `u_strToUTF8(dest=NULL, destCapacity=0, *destLength gets UTF-8 index).`*
 *   *ICU 4.4 has a technology preview for UText in the regular expression API,
     but some of the UText regex API and semantics are likely to change for ICU
     4.6. (Especially indexing semantics.)*

 A `UCharIterator` can be used with several collation APIs (although there is
 also the newer `icu::Collator::compareUTF8()`) and with `u_strCompareIter()`.
 `uiter_setUTF8()` creates a UCharIterator for a UTF-8 string.

 It is also possible to create a `CharacterIterator` subclass for UTF-8 strings,
 but `CharacterIterator` has a lot of virtual methods and it requires UTF-16
 string index semantics.
	---
	layout: default
	title: UTF-8
	nav_order: 1
	parent: Chars and Strings
	---
	<!--
	© 2020 and later: Unicode, Inc. and others.
	License & terms of use: http://www.unicode.org/copyright.html
	-->

	# UTF-8

	*Note: This page is only relevant for C/C++. In Java, all strings are encoded in
	UTF-16, except for conversion from bytes to strings (via InputStreamReader or
	similar) and from strings to bytes (OutputStreamWriter etc.).*

	While most of ICU works with UTF-16 strings and uses data structures optimized
	for UTF-16, there are APIs that facilitate working with UTF-8, or are optimized
	for UTF-8, or work with Unicode code points (21-bit integer values) regardless
	of string encoding. Some data structures are designed to work equally well with
	UTF-16 and UTF-8.

	For UTF-8 strings, ICU normally uses `(const) char *` pointers and `int32_t`
	lengths, normally with semantics parallel to UTF-16 handling. (Input length=-1
	means NUL-terminated, output is NUL-terminated if there is space, output
	overflow is handled with preflighting; for details see the parent [Strings
	page](index.md).) Some newer APIs take an `icu::StringPiece` argument and write
	to an `icu::ByteSink` or to a string class object like `std::string`.

	## Conversion Between UTF-8 and UTF-16

	The simplest way to use UTF-8 strings in UTF-16 APIs is via the C++
	`icu::UnicodeString` methods `fromUTF8(const StringPiece &utf8)` and
	`toUTF8String(StringClass &result)`. There is also `toUTF8(ByteSink &sink)`.

	In C, `unicode/ustring.h` has functions like `u_strFromUTF8WithSub()` and
	`u_strToUTF8WithSub()`. (Also `u_strFromUTF8()`, `u_strToUTF8()` and
	`u_strFromUTF8Lenient()`.)

	The conversion functions in `unicode/ucnv.h` are intended for very flexible
	handling of conversion to/from external byte streams (with customizable error
	handling and support for split buffers at arbitrary boundaries) which is
	normally unnecessary for internal strings.

	Note: `icu::``UnicodeString` has constructors, `setTo()` and `extract()` methods
	which take either a converter object or a charset name. These can be used for
	UTF-8, but are not as efficient or convenient as the
	`fromUTF8()`/`toUTF8()`/`toUTF8String()` methods mentioned above. (Among
	conversion methods, APIs with a charset name are more convenient but internally
	open and close a converter; ones with a converter object parameter avoid this.)

	## UTF-8 as Default Charset

	ICU has many functions that take or return `char *` strings that are assumed to
	be in the default charset which should match the system encoding. Since this
	could be one of many charsets, and the charset can be different for different
	processes on the same system, ICU uses its conversion framework for converting
	to and from UTF-16.

	If it is known that the default charset is always UTF-8 on the target platform,
	then you should `#define`` U_CHARSET_IS_UTF8 1` in or before `unicode/utypes.h`.
	(For example, modify the default value there or pass `-D``U_CHARSET_IS_UTF8=1`
	as a compiler flag.) This will change most of the implementation code to use
	dedicated (simpler, faster) UTF-8 code paths and avoid dependencies on the
	conversion framework. (Avoiding such dependencies helps with statically linked
	libraries and may allow the use of `UCONFIG_NO_LEGACY_CONVERSION` or even
	`UCONFIG_NO_CONVERSION` \[see `unicode/uconfig.h`\].)

	## Low-Level UTF-8 String Operations

	`unicode/utf8.h` defines macros for UTF-8 with semantics parallel to the UTF-16
	macros in `unicode/utf16.h`. The macros handle many cases inline, but call
	internal functions for complicated parts of the UTF-8 encoding form. For
	example, the following code snippet counts white space characters in a string:

	```c
	#include "unicode/utypes.h"
	#include "unicode/stringpiece.h"
	#include "unicode/utf8.h"
	#include "unicode/uchar.h"

	int32_t countWhiteSpace(StringPiece sp) {
	const char *s=sp.data();
	int32_t length=sp.length();
	int32_t count=0;
	for(int32_t i=0; i<length;) {
	UChar32 c;
	U8_NEXT(s, i, length, c);
	if(u_isUWhiteSpace(c)) {
	++count;
	}
	}
	return count;
	}
	```

	## Dedicated UTF-8 APIs

	ICU has some APIs dedicated for UTF-8. They tend to have been added for "worker
	functions" like comparing strings, to avoid the string conversion overhead,
	rather than for "builder functions" like factory methods and attribute setters.

	For example, `icu::Collator::compareUTF8()` compares two UTF-8 strings
	incrementally, without converting all of the two strings to UTF-16 if there is
	an early base letter difference.

	`ucnv_convertEx()` can convert between UTF-8 and another charset, if one of the
	two `UConverter`s is a UTF-8 converter. The conversion from UTF-8 to most
	other charsets uses a dedicated, optimized code path, avoiding the pivot through
	UTF-16. (Conversion from other charsets to UTF-8 could be optimized as well,
	but that has not been implemented yet as of ICU 4.4.)

	Other examples: (This list may or may not be complete.)

	* ucasemap_utf8ToLower(), ucasemap_utf8ToUpper(), ucasemap_utf8ToTitle(),
	ucasemap_utf8FoldCase()
	* ucnvsel_selectForUTF8()
	* icu::UnicodeSet::spanUTF8(), spanBackUTF8() and uset_spanUTF8(),
	uset_spanBackUTF8() (These are highly optimized for UTF-8 processing.)
	* ures_getUTF8String(), ures_getUTF8StringByIndex(), ures_getUTF8StringByKey()
	* uspoof_checkUTF8(), uspoof_areConfusableUTF8(), uspoof_getSkeletonUTF8()

	## Abstract Text APIs

	ICU offers several interfaces for text access, designed for different use cases.
	(Some interfaces are simply newer and more modern than others.) Some ICU
	services work with some of these interfaces, and for some of these interfaces
	ICU offers UTF-8 implementations out of the box.

	`UText` can be used with `BreakIterator` APIs (character/word/sentence/...
	segmentation). `utext_openUTF8()` creates a read-only `UText` for a UTF-8
	string.

	* *Note: In ICU 4.4 and before, BreakIterator only works with UTF-8 (or any
	other charset with non-1:1 index conversion to UTF-16) if no dictionary is
	supported. This excludes Thai word break. See [ticket #5532](https://unicode-org.atlassian.net/browse/ICU-5532).*
	* *As a workaround for Thai word breaking, you can convert the string to
	UTF-16 and convert indexes to UTF-8 string indexes via
	`u_strToUTF8(dest=NULL, destCapacity=0, destLength gets UTF-8 index).`
	* *ICU 4.4 has a technology preview for UText in the regular expression API,
	but some of the UText regex API and semantics are likely to change for ICU
	4.6. (Especially indexing semantics.)*

	A `UCharIterator` can be used with several collation APIs (although there is
	also the newer `icu::Collator::compareUTF8()`) and with `u_strCompareIter()`.
	`uiter_setUTF8()` creates a UCharIterator for a UTF-8 string.

	It is also possible to create a `CharacterIterator` subclass for UTF-8 strings,
	but `CharacterIterator` has a lot of virtual methods and it requires UTF-16
	string index semantics.