docs/userguide/unicode.md - external/github.com/unicode-org/icu - Git at Google

 ---
 layout: default
 title: Unicode Basics
 nav_order: 3
 parent: ICU
 ---
 <!--
 © 2020 and later: Unicode, Inc. and others.
 License & terms of use: http://www.unicode.org/copyright.html
 -->

 # Unicode Basics
 {: .no_toc }

 ## Contents
 {: .no_toc .text-delta }

 1. TOC
 {:toc}

 ---

 ## Introduction to Unicode

 Unicode is a standard that precisely defines a character set as well as a small
 number of encodings for it. It enables you to handle text in any language
 efficiently. It allows a single application executable to work for a global
 audience. ICU, like Java™, Microsoft® Windows NT™, Windows™ 2000 and other
 modern systems, provides Internationalization solutions based on Unicode.

 This chapter is intended as an introduction to codepages in general and Unicode
 in particular. For further information, see:

 1.  [The Web site of the Unicode consortium](http://www.unicode.org/)

 2.  [What is
     Unicode?](https://www.unicode.org/standard/WhatIsUnicode.html)

 3.  [IBM® Globalization](http://www.ibm.com/software/globalization/)

 Go to the [online ICU demos](http://demo.icu-project.org/icu-bin/icudemos) to
 see how a Unicode-based server application can handle text in many languages and
 many encodings.

 ## Traditional Character Sets and Unicode

 Representing text-format data in computers is a matter of defining a set of
 characters and assigning each of them a number and a bit representation.
 Underlying this basic idea are three related concepts:

 1.  A character set or repertoire is an unordered collection of characters that
     can be represented by numeric values.

 2.  A coded character set maps characters from a character set or repertoire to
     numeric values.

 3.  A character encoding scheme defines the representation of numeric values
     from one or more coded character sets in bits and bytes.

 For simple encodings such as ASCII, the last two concepts are basically the
 same: ASCII assigns 128 characters and control codes to consecutive numbers from
 0 to 127. These characters and control codes are encoded as simple, unsigned,
 binary integers. Therefore, ASCII is both a coded character set and a character
 encoding scheme.

 ASCII only encodes 128 characters, 33 of which are control codes rather than
 graphic, displayable characters. It was designed to represent English-language
 text for an American user base, and is therefore insufficient for representing
 text in almost any language other than American English. In fact, most
 traditional encodings were limited to one or few languages and scripts.

 ASCII offered a natural way to extend it: Designed in the 1960's to work in
 systems with 7-bit bytes while most computers and Internet protocols since the
 1970's use 8-bit bytes, the extra bit allowed another 128 byte values to
 represent more characters. Various encodings were developed that supported
 different languages. Some of these were based on ASCII, others were not.

 Languages such as Japanese need to encode considerably more than 256 characters.
 Various encoding schemes enable large character sets with thousands or tens of
 thousands of characters to be represented. Most of those encodings are still
 byte-based, which means that many characters require two or more bytes of
 storage space. A process must be developed to interpret some byte values.

 Various character sets and encoding schemes have been developed independently,
 cover only one or few languages each, and are incompatible. This makes it very
 difficult for a single system to handle text in more than one language at a
 time, and especially difficult to do so in a way that is interoperable across
 different systems.

 Generally, the minimum requirement for the interoperable exchange of text data
 is that the encoding (character set & encoding scheme) must be properly
 specified in the document and in the protocol. For example, email/SMTP and
 HTML/HTTP provide the means to specify the "charset", as it is called in
 Internet standards. However, very often the encoding is not specified, specified
 incorrectly, or the sender and receiver disagree on its implementation.

 The ISO 2022 encoding scheme was created to store text in many different
 languages. It allows other encodings to be embedded by first announcing them and
 then switching between them. Full support for all features and possible
 encodings with ISO 2022 requires complicated processing and the need to support
 many encodings. For East Asian languages, subsets were developed that cover only
 one language or a few at a time, but they are much more manageable. ISO 2022 is
 not well-suited for use in internal processing. It is designed for data
 exchange.

 ## Glyphs versus Characters

 Programmers often need to distinguish between characters and glyphs. A character
 is the smallest semantic unit in a writing system. It is an abstract concept
 such as the letter A or the exclamation point. A glyph is the visual
 presentation of one or more characters, and is often dependent on adjacent
 characters.

 There is not always a one-to-one mapping between characters and glyphs. In many
 languages (Arabic is a prime example), the way a character looks depends heavily
 on the surrounding characters. Standard printed Arabic has as many as four
 different printed representations (glyphs) for every letter of the alphabet. In
 many languages, two or more letters may combine together into a single glyph
 (called a ligature), or a single character might be displayed with more than one
 glyph.

 Despite the different visual variants of a particular letter, it still retains
 its identity. For example, the Arabic letter heh has four different visual
 representations in common use. Whichever one is used, it still keeps its
 identity as the letter heh. It is this identity that Unicode encodes, not the
 visual representation. This also cuts down on the number of independent
 character values required.

 ## Overview of Unicode

 Unicode was developed as a single-coded character set that contains support for
 all languages in the world. The first version of Unicode used 16-bit numbers,
 which allowed for encoding 65,536 characters without complicated multibyte
 schemes. With the inclusion of more characters, and following implementation
 needs of many different platforms, Unicode was extended to allow more than one
 million characters. Several other encoding schemes were added. This introduced
 more complexity into the Unicode standard, but far less than managing a large
 number of different encodings.

 Starting with Unicode 2.0 (published in 1996), the Unicode standard began
 assigning numbers from 0 to 10ffff<sub>16</sub>,which requires 21 bits but does not use
 them completely. This gives more than enough room for all written languages in
 the world. The original repertoire covered all major languages commonly used in
 computing. Unicode continues to grow, and it includes more scripts.

 The design of Unicode differs in several ways from traditional character sets
 and encoding schemes:

 1.  Its repertoire enables users to include text efficiently in almost all
     languages within a single document.

 2.  It can be encoded in a byte-based way with one or more bytes per character,
     but the default encoding scheme uses 16-bit units that allow much simpler
     processing for all common characters.

 3.  Many characters, such as letters with accents and umlauts, can be combined
     from the base character and accent or umlaut modifiers. This combining
     reduces the number of different characters that need to be encoded
     separately. "Precomposed" variants for characters that existed in common
     character sets at the time were included for compatibility.

 4.  Characters and their usage are well-defined and described. While traditional
     character sets typically only provide the name or a picture of a character
     and its number and byte encoding, Unicode has a comprehensive database of
     properties available for download. It also defines a number of processes and
     algorithms for dealing with many aspects of text processing to make it more
     interoperable.

 The early inclusion of all characters of commonly used character sets makes
 Unicode a useful "pivot" point for converting between traditional character
 sets, and makes it feasible to process non-Unicode text by first converting into
 Unicode, process the text, and convert it back to the original encoding without
 loss of data.

 > :point_right: *The first 128 Unicode code point values are assigned to the same characters as
 in US-ASCII. For example, the same number is assigned to the same character. The
 same is true for the first 256 code point values of Unicode compared to ISO
 8859-1 (Latin-1) which itself is a direct superset of US-ASCII. This makes it
 easy to adapt many applications to Unicode because the numbers for many
 syntactically important characters are the same.*

 ## Character Encoding Forms and Schemes for Unicode

 Unicode assigns characters a number from 0 to 10FFFF<sub>16</sub>, giving enough elbow room
 to allow for unambiguous encoding of every character in common use. Such a
 character number is called a "code point".

 > :point_right: *Unicode code points are just non-negative integer numbers in a certain range.
 They do not have an implicit binary representation or a width of 21 or 32 bits.
 Binary representation and unit widths are defined for encoding forms.*

 For internal processing, the standard defines three encoding forms, and for file
 storage and protocols, some of these encoding forms have encoding schemes that
 differ in their byte ordering. The difference between an encoding form and an
 encoding scheme is that an encoding form maps the character set codes to values
 that fit into internal data types (like a short in C), while an encoding scheme
 maps to bits and bytes. For traditional encodings, they are the same since the
 encoding forms already map to bytes.

 The different Unicode encoding forms are optimized for a variety of different
 uses:

 1.  UTF-16, the default encoding form, maps a character code point to either one
     or two 16-bit integers.

 2.  UTF-8 is a byte-based encoding that offers backwards compatibility with
     ASCII-based, byte-oriented APIs and protocols. A character is stored with 1,
     2, 3, or 4 bytes.

 3.  UTF-32 is the simplest, but most memory-intensive encoding form: It uses one
     32-bit integer per Unicode character.

 4.  SCSU is an encoding scheme that provides a simple compression of Unicode
     text. It is designed only for input and output, not for internal use.

 ICU uses UTF-16 internally. ICU 2.0 fully supports supplementary characters
 (with code points 10000<sub>16</sub>..10FFFF<sub>16</sub>). Older versions of ICU provided only partial
 support for supplementary characters.

 For input/output, character encoding schemes define a byte serialization of
 text. UTF-8 is itself both an encoding form, and an encoding scheme because it is
 byte-based. For each of UTF-16 and UTF-32, there are two variants defined: one
 that serializes the code units in big-endian byte order (most significant byte
 first), and one that serializes the code units in little-endian byte order
 (least significant byte first). The corresponding encoding schemes are called
 UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE.

 > :point_right: *The names "UTF-16" and "UTF-32" are ambiguous. Depending on context, they refer
 either to character encoding forms where 16/32-bit words are processed and are
 naturally stored in the platform endianness, or they refer to the
 IANA-registered charset names, i.e., to character encoding schemes or byte
 serializations. In addition to simple byte serialization, the charsets with
 these names also use optional Byte Order Marks (see [Serialized Formats](#serialized-formats) below).*

 ## Overview of UTF-16

 The default encoding form of the Unicode Standard uses 16-bit code units. Code
 point values for the most common characters are in the range of 0 to FFFF<sub>16</sub> and
 are encoded with just one 16-bit unit of the same value. Code points from
 10000<sub>16</sub> to 10FFFF<sub>16</sub> are encoded with two code units that are often called
 "surrogates", and they are called a "surrogate pair" when, together, they
 correctly encode one Unicode character. The first surrogate in a pair must be in
 the range D800<sub>16</sub> to DBFF<sub>16</sub>, and the second one must be in the range DC00<sub>16</sub> to
 DFFF<sub>16</sub>. Every Unicode code point has only one possible UTF-16 encoding with
 either one code unit that is not a surrogate or with a correct pair of
 surrogates. The code point values D800<sub>16</sub> to DFFF<sub>16</sub> are set aside just for this
 mechanism and will never, by themselves, be assigned any characters.

 Most commonly used characters have code points below FFFF<sub>16</sub>, but Unicode 3.1
 assigns more than 40,000 supplementary characters that make use of surrogate
 pairs in UTF-16.

 Note that comparing UTF-16 strings lexically based on their 16-bit code units
 does not result in the same order as comparing the code points. This is not
 usually an issue since only rarely-used characters are affected. Most processes
 do not rely on the same results in such comparisons. Where necessary, a simple
 modification to a string comparison can be performed that still allows efficient
 code unit-based comparisons and makes them compatible with code point
 comparisons. ICU has C and C++ API functions for this.

 ## Overview of UTF-8

 To meet the requirements of byte-oriented, ASCII-based systems, the Unicode
 Standard defines UTF-8. UTF-8 is a variable-length, byte-based encoding that
 preserves ASCII transparency.

 UTF-8 maintains transparency for all the ASCII code values (0..127). These
 values do not appear in any byte of a transformed result except as the direct
 representation of the ASCII values. Thus, ASCII text is also UTF-8 text.

 Characteristics of UTF-8 include:

 1.  Unicode code points 0 to 7F<sub>16</sub> are each encoded with a single byte of the
     same value. Therefore, ASCII characters take up 50% less space with UTF-8
     encoding than with UTF-16.

 2.  All other code points are encoded with multibyte sequences, with the first
     byte (lead byte) indicating the number of bytes that follow (trail bytes).
     This results in very efficient parsing. The lead bytes are in the range c0<sub>16</sub>
     to fd<sub>16</sub>, the trail bytes are in the range 80<sub>16</sub> to bf<sub>16</sub>. The byte values fe<sub>16</sub>
     and FF<sub>16</sub> are never used.

 3.  UTF-8 is relatively compact and resource conservative in its use of the
     bytes required for encoding text in European scripts, but uses 50% more
     space than UTF-16 for East Asian text. Code points up to 7FF<sub>16</sub> take up two
     bytes, code points up to FFFF<sub>16</sub> take up three (50% more memory than UTF-16),
     and all others four.

 4.  Binary comparisons of UTF-8 strings based on their bytes result in the same
     order as comparing code point values.

 ## Overview of UTF-32

 The UTF-32 encoding form always uses one single 32-bit integer per Unicode code
 point. This results in a very simple encoding.

 The drawback is its memory consumption: Since code point values use only 21
 bits, one-third of the memory is always unused, and since most commonly used
 characters have code point values of up to FFFF<sub>16</sub>, they take up only one 16-bit
 unit in UTF-16 (50% less) and up to three bytes in UTF-8 (25% less).

 UTF-32 is mainly used in APIs that are defined with the same data type for both
 code points and code units. Modern versions of the C standard library that
 support Unicode use a 32-bit `wchar_t` with UTF-32 semantics.

 ## Overview of SCSU

 SCSU (Standard Compression Scheme for Unicode) is designed to reduce the size of
 Unicode text for both input and output. It is a simple compression that
 transforms the text into a byte stream. It typically uses one byte per character
 in small scripts, and two bytes per character in large, East Asian scripts.

 It is usually shorter than any of the UTFs. However, SCSU is stateful, which
 makes it unsuitable for internal processing. It also uses all possible byte
 values, which might require additional processing for protocols such as SMTP
 (email).

 See also <https://www.unicode.org/reports/tr6/> .

 ## Other Unicode Encodings

 Other Unicode encodings have been developed over time for various purposes. Most
 of them are implemented in ICU, see
 [source/data/mappings/convrtrs.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt)

 1.  BOCU-1: Binary-Ordered Compression of Unicode
     An encoding of Unicode that is about as compact as SCSU but has a much
     smaller amount of state. Unlike SCSU, it preserves code point order and can
     be used in 8bit emails without a transfer encoding. BOCU-1 does **not**
     preserve ASCII characters in ASCII-readable form. See [Unicode Technical
     Note #6](http://www.unicode.org/notes/tn6/) .

 2.  UTF-7: Designed for 7bit emails; simple and not very compact. Since email
     systems have been 8-bit safe for several years, UTF-7 is not necessary any
     more and not recommended. Most ASCII characters are readable, others are
     base64-encoded. See [RFC 2152](http://www.ietf.org/rfc/rfc2152.txt) .

 3.  IMAP-mailbox-name: A variant of UTF-7 that is suitable for expressing
     Unicode strings as ASCII characters for Unix filenames.
     **The name "IMAP-mailbox-name" is specific to ICU!**
     See [RFC 2060 INTERNET MESSAGE ACCESS PROTOCOL - VERSION
     4rev1](http://www.ietf.org/rfc/rfc2060.txt) section 5.1.3. Mailbox
     International Naming Convention.

 4.  UTF-EBCDIC: An EBCDIC-friendly encoding that is similar to UTF-8. See
     [Unicode Technical Report #16](http://www.unicode.org/reports/tr16/) . **As
     of ICU 2.6, UTF-EBCDIC is not implemented in ICU.**

 5.  CESU-8: Compatibility Encoding Scheme for UTF-16: 8-Bit
     An incompatible variant of UTF-8 that preserves 16-bit-Unicode (UTF-16)
     string order instead of code point order. Not for open interchange. See
     [Unicode Technical Report #26](http://www.unicode.org/reports/tr26/) .

 ## Programming using UTFs

 Programming using any of the UTFs is much more straightforward than with
 traditional multi-byte character encodings, even though UTF-8 and UTF-16 are
 also variable-width encodings.

 Within each Unicode encoding form, the code unit values for singletons (code
 units that alone encode characters), lead units, and for trailing units are all
 disjointed. This has crucial implications for implementations. The following
 lists these implications:

 1.  Determines the number of units for one code point using the lead unit. This
     is especially important for UTF-8, where there can be up to 4 bytes per
     character.

 2.  Determines boundaries. If ICU users randomly access text, you can always
     determine the nearest code-point boundaries with a small number of machine
     instructions.

 3.  Does not have any overlap. If ICU users search for string A in string B, you
     never get a false match on code points. Users do not need to convert to code
     points for string searching. False matches never occurs since the end of one
     sequence is never the same as the start of another sequence. Overlap is one
     of the biggest problems with common multi-byte encodings like Shift-JIS. All
     the UTFs avoid this problem.

 4.  Uses simple iteration. Getting the next or previous code point is
     straightforward, and only takes a small number of machine instructions.

 5.  Can use UTF-16 encoding, which is actually fully symmetric. ICU users can
     determine from any single code unit whether it is the first, last, or only
     one for a code point. Moving (iterating) in either direction through UTF-16
     text is equally fast and efficient.

 6.  Uses slow indexing by code points. This indexing procedure is a disadvantage
     of all variable-width encodings. Except in UTF-32, it is inefficient to find
     code unit boundaries corresponding to the nth code point or to find the code
     point offset containing the nth code unit. Both involve scanning from the
     start of the text or from a last known boundary. ICU, like most common APIs,
     always indexes by code units. It counts code units and not code points.

 Conversion between different UTFs is very fast. Unlike converting to and from
 legacy encodings like Latin-2, conversion between UTFs does not require table
 look-ups.

 ICU provides two basic data type definitions for Unicode. `UChar32` is a 32-bit
 type for code points, and used for single Unicode characters. It may be signed
 or unsigned. It is the same as `wchar_t` if it is 32 bits wide. `UChar` is an
 unsigned 16-bit integer for UTF-16 code units. It is the base type for strings
 (`UChar *`), and it is the same as `wchar_t` if it is 16 bits wide.

 Some higher-level APIs, used especially for formatting, use characters closer to
 a representation for a glyph. Such "user characters" are also called "graphemes"
 or "grapheme clusters" and require strings so that combining sequences can be
 included.

 ## Serialized Formats

 In files, input, output, and network protocols, text must be accompanied by the
 specification of its character encoding scheme for a client to be able to
 interpret it correctly. (This is called a "charset" in Internet protocols.)
 However, an encoding scheme specification is not necessary if the text is only
 used within a single platform, protocol, or application where it is otherwise
 clear what the encoding is. (The language and text directionality should usually
 be specified to enable spell checking, text-to-speech transformation, etc.)

 *The discussion of encoding specifications in this section applies to standard
 Internet protocols where charset name strings are used. Other protocols may use
 numeric encoding identifiers and assign different semantics to those identifiers
 than Internet protocols.*

 Typically, the encoding specification is done in a protocol- and document
 format-dependent way. However, the Unicode standard offers a mechanism for
 tagging text files with a "signature" for cases where protocols do not identify
 character encoding schemes.

 The character ZERO WIDTH NO-BREAK SPACE (FEFF<sub>16</sub>) can be used as a signature by
 prepending it to a file or stream. The alternative function of U+FEFF as a
 format control character has been copied to U+2060 WORD JOINER, and U+FEFF
 should only be used for Unicode signatures.

 The different character encoding schemes generate different, distinct byte
 sequences for U+FEFF:

 1.  UTF-8: EF BB BF

 2.  UTF-16BE: FE FF

 3.  UTF-16LE: FF FE

 4.  UTF-32BE: 00 00 FE FF

 5.  UTF-32LE: FF FE 00 00

 6.  SCSU: 0E FE FF

 7.  BOCU-1: FB EE 28

 8.  UTF-7: 2B 2F 76 ( 38 | 39 | 2B | 2F )

 9.  UTF-EBCDIC: DD 73 66 73

 ICU provides the function `ucnv_detectUnicodeSignature()` for Unicode signature
 detection.

 *There is no signature for CESU-8 separate from the one for UTF-8. UTF-8 and
 CESU-8 encode U+FEFF and in fact all BMP code points with the same bytes. The
 opportunity for misidentification of one as the other is one of the reasons why
 CESU-8 should only be used in limited, closed, specific environments.*

 In UTF-16 and UTF-32, where the signature also distinguishes between big-endian
 and little-endian byte orders, it is also called a byte order mark (BOM). The
 signature works for UTF-16 since the code point that has the byte-swapped
 encoding, FFFE<sub>16</sub>, will never be a valid Unicode character. (It is a
 "non-character" code point.) In Internet protocols, if an encoding specification
 of "UTF-16" or "UTF-32" is used, it is expected that there is a signature byte
 sequence (BOM) that identifies the byte ordering, which is not the case for the
 encoding scheme/charset names with "BE" or "LE".

 *If text is specified to be encoded in the UTF-16 or UTF-32 charset and does not
 begin with a BOM, then it must be interpreted as UTF-16BE or UTF-32BE,
 respectively.*

 A signature is not part of the content, and must be stripped when processing.
 For example, blindly concatenating two files will give an incorrect result.

 If a signature was detected, then the signature "character" U+FEFF should be
 removed from the Unicode stream **after** conversion. Removing the signature
 bytes before conversion could cause the conversion to fail for stateful
 encodings like BOCU-1 and UTF-7.

 Whether a signature is to be recognized or not depends on the protocol or
 application.

 1.  If a protocol specifies a charset name, then the byte stream must be
     interpreted according to how that name is defined. Only the "UTF-16" and
     "UTF-32" names include recognition of the byte order marks that are specific
     to them (and the ICU converters for these names do this automatically). None
     of the other Unicode charsets are defined to include any signature/BOM
     handling.

 2.  If no charset name is provided, for example for text files in most
     filesystems, then applications must usually rely on heuristics to determine
     the file encoding. Many document formats contain an embedded or implicit
     encoding declaration, but for plain text files it is reasonable to use
     Unicode signatures as simple and reliable heuristics. This is especially
     common on Windows systems. However, some tools for plain text file handling
     (e.g., many Unix command line tools) are not prepared for Unicode
     signatures.

 ## The Unicode Standard Is An Industry Standard

 The Unicode standard is an industry standard and parallels ISO 10646-1. Around
 1993, these two standards were effectively merged into the same character set
 standard. Both standards have the same character repertoire and the same
 encoding forms and schemes.

 One difference used to be that the ISO standard defined code point values to be
 from 0 to 7FFFFFFF<sub>16</sub>, not just up to 10FFFF<sub>16</sub>. The ISO work group decided to add
 an amendment to the standard. The amendment removes this difference by declaring
 that no characters will ever be assigned code points above 10FFFF<sub>16</sub>. The main
 reason for the ISO work group's decision is interoperability between the UTFs.
 UTF-16 can not encode any code points above this limit.

 This means that the code point space for both Unicode and ISO 10646 is now the
 same! **These changes to ISO 10646 have been made recently and should be
 complete in the edition ISO 10646:2003 which also combines all parts of the
 standard into one.**

 The former, larger code space is the reason why the ISO definition of UTF-8
 specifies sequences of five and six bytes to cover that whole range.

 Another difference is that the ISO standard defines encoding forms "UCS-4" and
 "UCS-2". UCS-4 is essentially UTF-32 with a theoretical upper limit of
 7FFFFFFF<sub>16</sub>, using 31 out of the 32 bits. However, in practice, the ISO committee
 has accepted that the characters above 10FFFF will not be encoded, so there is
 essentially no difference between the forms. The "4" stands for "four-byte
 form".

 UCS-2 is a subset of UTF-16 that is limited to code points from 0 to FFFF,
 excluding the surrogate code points. Thus, it cannot represent the characters
 with code points above FFFF (called supplementary characters).

 *There is no conversion necessary between UCS-2 and UTF-16. The difference is
 only in the interpretation of surrogates.*

 The standards differ in what kind of information they provide: The Unicode
 standard provides more character properties and describes algorithms etc., while
 the ISO standard defines collections, subsets and similar.

 The standards are synchronized, and the respective committees work together to
 add new characters and assign code point values.