| --- |
| layout: default |
| title: Unicode Basics |
| nav_order: 3 |
| parent: ICU |
| --- |
| <!-- |
| © 2020 and later: Unicode, Inc. and others. |
| License & terms of use: http://www.unicode.org/copyright.html |
| --> |
| |
| # Unicode Basics |
| {: .no_toc } |
| |
| ## Contents |
| {: .no_toc .text-delta } |
| |
| 1. TOC |
| {:toc} |
| |
| --- |
| |
| ## Introduction to Unicode |
| |
| Unicode is a standard that precisely defines a character set as well as a small |
| number of encodings for it. It enables you to handle text in any language |
| efficiently. It allows a single application executable to work for a global |
| audience. ICU, like Java™, Microsoft® Windows NT™, Windows™ 2000 and other |
| modern systems, provides Internationalization solutions based on Unicode. |
| |
| This chapter is intended as an introduction to codepages in general and Unicode |
| in particular. For further information, see: |
| |
| 1. [The Web site of the Unicode consortium](http://www.unicode.org/) |
| |
| 2. [What is |
| Unicode?](https://www.unicode.org/standard/WhatIsUnicode.html) |
| |
| 3. [IBM® Globalization](http://www.ibm.com/software/globalization/) |
| |
| Go to the [online ICU demos](http://demo.icu-project.org/icu-bin/icudemos) to |
| see how a Unicode-based server application can handle text in many languages and |
| many encodings. |
| |
| ## Traditional Character Sets and Unicode |
| |
| Representing text-format data in computers is a matter of defining a set of |
| characters and assigning each of them a number and a bit representation. |
| Underlying this basic idea are three related concepts: |
| |
| 1. A character set or repertoire is an unordered collection of characters that |
| can be represented by numeric values. |
| |
| 2. A coded character set maps characters from a character set or repertoire to |
| numeric values. |
| |
| 3. A character encoding scheme defines the representation of numeric values |
| from one or more coded character sets in bits and bytes. |
| |
| For simple encodings such as ASCII, the last two concepts are basically the |
| same: ASCII assigns 128 characters and control codes to consecutive numbers from |
| 0 to 127. These characters and control codes are encoded as simple, unsigned, |
| binary integers. Therefore, ASCII is both a coded character set and a character |
| encoding scheme. |
| |
| ASCII only encodes 128 characters, 33 of which are control codes rather than |
| graphic, displayable characters. It was designed to represent English-language |
| text for an American user base, and is therefore insufficient for representing |
| text in almost any language other than American English. In fact, most |
| traditional encodings were limited to one or few languages and scripts. |
| |
| ASCII offered a natural way to extend it: Designed in the 1960's to work in |
| systems with 7-bit bytes while most computers and Internet protocols since the |
| 1970's use 8-bit bytes, the extra bit allowed another 128 byte values to |
| represent more characters. Various encodings were developed that supported |
| different languages. Some of these were based on ASCII, others were not. |
| |
| Languages such as Japanese need to encode considerably more than 256 characters. |
| Various encoding schemes enable large character sets with thousands or tens of |
| thousands of characters to be represented. Most of those encodings are still |
| byte-based, which means that many characters require two or more bytes of |
| storage space. A process must be developed to interpret some byte values. |
| |
| Various character sets and encoding schemes have been developed independently, |
| cover only one or few languages each, and are incompatible. This makes it very |
| difficult for a single system to handle text in more than one language at a |
| time, and especially difficult to do so in a way that is interoperable across |
| different systems. |
| |
| Generally, the minimum requirement for the interoperable exchange of text data |
| is that the encoding (character set & encoding scheme) must be properly |
| specified in the document and in the protocol. For example, email/SMTP and |
| HTML/HTTP provide the means to specify the "charset", as it is called in |
| Internet standards. However, very often the encoding is not specified, specified |
| incorrectly, or the sender and receiver disagree on its implementation. |
| |
| The ISO 2022 encoding scheme was created to store text in many different |
| languages. It allows other encodings to be embedded by first announcing them and |
| then switching between them. Full support for all features and possible |
| encodings with ISO 2022 requires complicated processing and the need to support |
| many encodings. For East Asian languages, subsets were developed that cover only |
| one language or a few at a time, but they are much more manageable. ISO 2022 is |
| not well-suited for use in internal processing. It is designed for data |
| exchange. |
| |
| ## Glyphs versus Characters |
| |
| Programmers often need to distinguish between characters and glyphs. A character |
| is the smallest semantic unit in a writing system. It is an abstract concept |
| such as the letter A or the exclamation point. A glyph is the visual |
| presentation of one or more characters, and is often dependent on adjacent |
| characters. |
| |
| There is not always a one-to-one mapping between characters and glyphs. In many |
| languages (Arabic is a prime example), the way a character looks depends heavily |
| on the surrounding characters. Standard printed Arabic has as many as four |
| different printed representations (glyphs) for every letter of the alphabet. In |
| many languages, two or more letters may combine together into a single glyph |
| (called a ligature), or a single character might be displayed with more than one |
| glyph. |
| |
| Despite the different visual variants of a particular letter, it still retains |
| its identity. For example, the Arabic letter heh has four different visual |
| representations in common use. Whichever one is used, it still keeps its |
| identity as the letter heh. It is this identity that Unicode encodes, not the |
| visual representation. This also cuts down on the number of independent |
| character values required. |
| |
| ## Overview of Unicode |
| |
| Unicode was developed as a single-coded character set that contains support for |
| all languages in the world. The first version of Unicode used 16-bit numbers, |
| which allowed for encoding 65,536 characters without complicated multibyte |
| schemes. With the inclusion of more characters, and following implementation |
| needs of many different platforms, Unicode was extended to allow more than one |
| million characters. Several other encoding schemes were added. This introduced |
| more complexity into the Unicode standard, but far less than managing a large |
| number of different encodings. |
| |
| Starting with Unicode 2.0 (published in 1996), the Unicode standard began |
| assigning numbers from 0 to 10ffff<sub>16</sub>,which requires 21 bits but does not use |
| them completely. This gives more than enough room for all written languages in |
| the world. The original repertoire covered all major languages commonly used in |
| computing. Unicode continues to grow, and it includes more scripts. |
| |
| The design of Unicode differs in several ways from traditional character sets |
| and encoding schemes: |
| |
| 1. Its repertoire enables users to include text efficiently in almost all |
| languages within a single document. |
| |
| 2. It can be encoded in a byte-based way with one or more bytes per character, |
| but the default encoding scheme uses 16-bit units that allow much simpler |
| processing for all common characters. |
| |
| 3. Many characters, such as letters with accents and umlauts, can be combined |
| from the base character and accent or umlaut modifiers. This combining |
| reduces the number of different characters that need to be encoded |
| separately. "Precomposed" variants for characters that existed in common |
| character sets at the time were included for compatibility. |
| |
| 4. Characters and their usage are well-defined and described. While traditional |
| character sets typically only provide the name or a picture of a character |
| and its number and byte encoding, Unicode has a comprehensive database of |
| properties available for download. It also defines a number of processes and |
| algorithms for dealing with many aspects of text processing to make it more |
| interoperable. |
| |
| The early inclusion of all characters of commonly used character sets makes |
| Unicode a useful "pivot" point for converting between traditional character |
| sets, and makes it feasible to process non-Unicode text by first converting into |
| Unicode, process the text, and convert it back to the original encoding without |
| loss of data. |
| |
| > :point_right: *The first 128 Unicode code point values are assigned to the same characters as |
| in US-ASCII. For example, the same number is assigned to the same character. The |
| same is true for the first 256 code point values of Unicode compared to ISO |
| 8859-1 (Latin-1) which itself is a direct superset of US-ASCII. This makes it |
| easy to adapt many applications to Unicode because the numbers for many |
| syntactically important characters are the same.* |
| |
| ## Character Encoding Forms and Schemes for Unicode |
| |
| Unicode assigns characters a number from 0 to 10FFFF<sub>16</sub>, giving enough elbow room |
| to allow for unambiguous encoding of every character in common use. Such a |
| character number is called a "code point". |
| |
| > :point_right: *Unicode code points are just non-negative integer numbers in a certain range. |
| They do not have an implicit binary representation or a width of 21 or 32 bits. |
| Binary representation and unit widths are defined for encoding forms.* |
| |
| For internal processing, the standard defines three encoding forms, and for file |
| storage and protocols, some of these encoding forms have encoding schemes that |
| differ in their byte ordering. The difference between an encoding form and an |
| encoding scheme is that an encoding form maps the character set codes to values |
| that fit into internal data types (like a short in C), while an encoding scheme |
| maps to bits and bytes. For traditional encodings, they are the same since the |
| encoding forms already map to bytes. |
| |
| The different Unicode encoding forms are optimized for a variety of different |
| uses: |
| |
| 1. UTF-16, the default encoding form, maps a character code point to either one |
| or two 16-bit integers. |
| |
| 2. UTF-8 is a byte-based encoding that offers backwards compatibility with |
| ASCII-based, byte-oriented APIs and protocols. A character is stored with 1, |
| 2, 3, or 4 bytes. |
| |
| 3. UTF-32 is the simplest, but most memory-intensive encoding form: It uses one |
| 32-bit integer per Unicode character. |
| |
| 4. SCSU is an encoding scheme that provides a simple compression of Unicode |
| text. It is designed only for input and output, not for internal use. |
| |
| ICU uses UTF-16 internally. ICU 2.0 fully supports supplementary characters |
| (with code points 10000<sub>16</sub>..10FFFF<sub>16</sub>). Older versions of ICU provided only partial |
| support for supplementary characters. |
| |
| For input/output, character encoding schemes define a byte serialization of |
| text. UTF-8 is itself both an encoding form, and an encoding scheme because it is |
| byte-based. For each of UTF-16 and UTF-32, there are two variants defined: one |
| that serializes the code units in big-endian byte order (most significant byte |
| first), and one that serializes the code units in little-endian byte order |
| (least significant byte first). The corresponding encoding schemes are called |
| UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE. |
| |
| > :point_right: *The names "UTF-16" and "UTF-32" are ambiguous. Depending on context, they refer |
| either to character encoding forms where 16/32-bit words are processed and are |
| naturally stored in the platform endianness, or they refer to the |
| IANA-registered charset names, i.e., to character encoding schemes or byte |
| serializations. In addition to simple byte serialization, the charsets with |
| these names also use optional Byte Order Marks (see [Serialized Formats](#serialized-formats) below).* |
| |
| ## Overview of UTF-16 |
| |
| The default encoding form of the Unicode Standard uses 16-bit code units. Code |
| point values for the most common characters are in the range of 0 to FFFF<sub>16</sub> and |
| are encoded with just one 16-bit unit of the same value. Code points from |
| 10000<sub>16</sub> to 10FFFF<sub>16</sub> are encoded with two code units that are often called |
| "surrogates", and they are called a "surrogate pair" when, together, they |
| correctly encode one Unicode character. The first surrogate in a pair must be in |
| the range D800<sub>16</sub> to DBFF<sub>16</sub>, and the second one must be in the range DC00<sub>16</sub> to |
| DFFF<sub>16</sub>. Every Unicode code point has only one possible UTF-16 encoding with |
| either one code unit that is not a surrogate or with a correct pair of |
| surrogates. The code point values D800<sub>16</sub> to DFFF<sub>16</sub> are set aside just for this |
| mechanism and will never, by themselves, be assigned any characters. |
| |
| Most commonly used characters have code points below FFFF<sub>16</sub>, but Unicode 3.1 |
| assigns more than 40,000 supplementary characters that make use of surrogate |
| pairs in UTF-16. |
| |
| Note that comparing UTF-16 strings lexically based on their 16-bit code units |
| does not result in the same order as comparing the code points. This is not |
| usually an issue since only rarely-used characters are affected. Most processes |
| do not rely on the same results in such comparisons. Where necessary, a simple |
| modification to a string comparison can be performed that still allows efficient |
| code unit-based comparisons and makes them compatible with code point |
| comparisons. ICU has C and C++ API functions for this. |
| |
| ## Overview of UTF-8 |
| |
| To meet the requirements of byte-oriented, ASCII-based systems, the Unicode |
| Standard defines UTF-8. UTF-8 is a variable-length, byte-based encoding that |
| preserves ASCII transparency. |
| |
| UTF-8 maintains transparency for all the ASCII code values (0..127). These |
| values do not appear in any byte of a transformed result except as the direct |
| representation of the ASCII values. Thus, ASCII text is also UTF-8 text. |
| |
| Characteristics of UTF-8 include: |
| |
| 1. Unicode code points 0 to 7F<sub>16</sub> are each encoded with a single byte of the |
| same value. Therefore, ASCII characters take up 50% less space with UTF-8 |
| encoding than with UTF-16. |
| |
| 2. All other code points are encoded with multibyte sequences, with the first |
| byte (lead byte) indicating the number of bytes that follow (trail bytes). |
| This results in very efficient parsing. The lead bytes are in the range c0<sub>16</sub> |
| to fd<sub>16</sub>, the trail bytes are in the range 80<sub>16</sub> to bf<sub>16</sub>. The byte values fe<sub>16</sub> |
| and FF<sub>16</sub> are never used. |
| |
| 3. UTF-8 is relatively compact and resource conservative in its use of the |
| bytes required for encoding text in European scripts, but uses 50% more |
| space than UTF-16 for East Asian text. Code points up to 7FF<sub>16</sub> take up two |
| bytes, code points up to FFFF<sub>16</sub> take up three (50% more memory than UTF-16), |
| and all others four. |
| |
| 4. Binary comparisons of UTF-8 strings based on their bytes result in the same |
| order as comparing code point values. |
| |
| ## Overview of UTF-32 |
| |
| The UTF-32 encoding form always uses one single 32-bit integer per Unicode code |
| point. This results in a very simple encoding. |
| |
| The drawback is its memory consumption: Since code point values use only 21 |
| bits, one-third of the memory is always unused, and since most commonly used |
| characters have code point values of up to FFFF<sub>16</sub>, they take up only one 16-bit |
| unit in UTF-16 (50% less) and up to three bytes in UTF-8 (25% less). |
| |
| UTF-32 is mainly used in APIs that are defined with the same data type for both |
| code points and code units. Modern versions of the C standard library that |
| support Unicode use a 32-bit `wchar_t` with UTF-32 semantics. |
| |
| ## Overview of SCSU |
| |
| SCSU (Standard Compression Scheme for Unicode) is designed to reduce the size of |
| Unicode text for both input and output. It is a simple compression that |
| transforms the text into a byte stream. It typically uses one byte per character |
| in small scripts, and two bytes per character in large, East Asian scripts. |
| |
| It is usually shorter than any of the UTFs. However, SCSU is stateful, which |
| makes it unsuitable for internal processing. It also uses all possible byte |
| values, which might require additional processing for protocols such as SMTP |
| (email). |
| |
| See also <https://www.unicode.org/reports/tr6/> . |
| |
| ## Other Unicode Encodings |
| |
| Other Unicode encodings have been developed over time for various purposes. Most |
| of them are implemented in ICU, see |
| [source/data/mappings/convrtrs.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt) |
| |
| 1. BOCU-1: Binary-Ordered Compression of Unicode |
| An encoding of Unicode that is about as compact as SCSU but has a much |
| smaller amount of state. Unlike SCSU, it preserves code point order and can |
| be used in 8bit emails without a transfer encoding. BOCU-1 does **not** |
| preserve ASCII characters in ASCII-readable form. See [Unicode Technical |
| Note #6](http://www.unicode.org/notes/tn6/) . |
| |
| 2. UTF-7: Designed for 7bit emails; simple and not very compact. Since email |
| systems have been 8-bit safe for several years, UTF-7 is not necessary any |
| more and not recommended. Most ASCII characters are readable, others are |
| base64-encoded. See [RFC 2152](http://www.ietf.org/rfc/rfc2152.txt) . |
| |
| 3. IMAP-mailbox-name: A variant of UTF-7 that is suitable for expressing |
| Unicode strings as ASCII characters for Unix filenames. |
| **The name "IMAP-mailbox-name" is specific to ICU!** |
| See [RFC 2060 INTERNET MESSAGE ACCESS PROTOCOL - VERSION |
| 4rev1](http://www.ietf.org/rfc/rfc2060.txt) section 5.1.3. Mailbox |
| International Naming Convention. |
| |
| 4. UTF-EBCDIC: An EBCDIC-friendly encoding that is similar to UTF-8. See |
| [Unicode Technical Report #16](http://www.unicode.org/reports/tr16/) . **As |
| of ICU 2.6, UTF-EBCDIC is not implemented in ICU.** |
| |
| 5. CESU-8: Compatibility Encoding Scheme for UTF-16: 8-Bit |
| An incompatible variant of UTF-8 that preserves 16-bit-Unicode (UTF-16) |
| string order instead of code point order. Not for open interchange. See |
| [Unicode Technical Report #26](http://www.unicode.org/reports/tr26/) . |
| |
| ## Programming using UTFs |
| |
| Programming using any of the UTFs is much more straightforward than with |
| traditional multi-byte character encodings, even though UTF-8 and UTF-16 are |
| also variable-width encodings. |
| |
| Within each Unicode encoding form, the code unit values for singletons (code |
| units that alone encode characters), lead units, and for trailing units are all |
| disjointed. This has crucial implications for implementations. The following |
| lists these implications: |
| |
| 1. Determines the number of units for one code point using the lead unit. This |
| is especially important for UTF-8, where there can be up to 4 bytes per |
| character. |
| |
| 2. Determines boundaries. If ICU users randomly access text, you can always |
| determine the nearest code-point boundaries with a small number of machine |
| instructions. |
| |
| 3. Does not have any overlap. If ICU users search for string A in string B, you |
| never get a false match on code points. Users do not need to convert to code |
| points for string searching. False matches never occurs since the end of one |
| sequence is never the same as the start of another sequence. Overlap is one |
| of the biggest problems with common multi-byte encodings like Shift-JIS. All |
| the UTFs avoid this problem. |
| |
| 4. Uses simple iteration. Getting the next or previous code point is |
| straightforward, and only takes a small number of machine instructions. |
| |
| 5. Can use UTF-16 encoding, which is actually fully symmetric. ICU users can |
| determine from any single code unit whether it is the first, last, or only |
| one for a code point. Moving (iterating) in either direction through UTF-16 |
| text is equally fast and efficient. |
| |
| 6. Uses slow indexing by code points. This indexing procedure is a disadvantage |
| of all variable-width encodings. Except in UTF-32, it is inefficient to find |
| code unit boundaries corresponding to the nth code point or to find the code |
| point offset containing the nth code unit. Both involve scanning from the |
| start of the text or from a last known boundary. ICU, like most common APIs, |
| always indexes by code units. It counts code units and not code points. |
| |
| Conversion between different UTFs is very fast. Unlike converting to and from |
| legacy encodings like Latin-2, conversion between UTFs does not require table |
| look-ups. |
| |
| ICU provides two basic data type definitions for Unicode. `UChar32` is a 32-bit |
| type for code points, and used for single Unicode characters. It may be signed |
| or unsigned. It is the same as `wchar_t` if it is 32 bits wide. `UChar` is an |
| unsigned 16-bit integer for UTF-16 code units. It is the base type for strings |
| (`UChar *`), and it is the same as `wchar_t` if it is 16 bits wide. |
| |
| Some higher-level APIs, used especially for formatting, use characters closer to |
| a representation for a glyph. Such "user characters" are also called "graphemes" |
| or "grapheme clusters" and require strings so that combining sequences can be |
| included. |
| |
| ## Serialized Formats |
| |
| In files, input, output, and network protocols, text must be accompanied by the |
| specification of its character encoding scheme for a client to be able to |
| interpret it correctly. (This is called a "charset" in Internet protocols.) |
| However, an encoding scheme specification is not necessary if the text is only |
| used within a single platform, protocol, or application where it is otherwise |
| clear what the encoding is. (The language and text directionality should usually |
| be specified to enable spell checking, text-to-speech transformation, etc.) |
| |
| *The discussion of encoding specifications in this section applies to standard |
| Internet protocols where charset name strings are used. Other protocols may use |
| numeric encoding identifiers and assign different semantics to those identifiers |
| than Internet protocols.* |
| |
| Typically, the encoding specification is done in a protocol- and document |
| format-dependent way. However, the Unicode standard offers a mechanism for |
| tagging text files with a "signature" for cases where protocols do not identify |
| character encoding schemes. |
| |
| The character ZERO WIDTH NO-BREAK SPACE (FEFF<sub>16</sub>) can be used as a signature by |
| prepending it to a file or stream. The alternative function of U+FEFF as a |
| format control character has been copied to U+2060 WORD JOINER, and U+FEFF |
| should only be used for Unicode signatures. |
| |
| The different character encoding schemes generate different, distinct byte |
| sequences for U+FEFF: |
| |
| 1. UTF-8: EF BB BF |
| |
| 2. UTF-16BE: FE FF |
| |
| 3. UTF-16LE: FF FE |
| |
| 4. UTF-32BE: 00 00 FE FF |
| |
| 5. UTF-32LE: FF FE 00 00 |
| |
| 6. SCSU: 0E FE FF |
| |
| 7. BOCU-1: FB EE 28 |
| |
| 8. UTF-7: 2B 2F 76 ( 38 | 39 | 2B | 2F ) |
| |
| 9. UTF-EBCDIC: DD 73 66 73 |
| |
| ICU provides the function `ucnv_detectUnicodeSignature()` for Unicode signature |
| detection. |
| |
| *There is no signature for CESU-8 separate from the one for UTF-8. UTF-8 and |
| CESU-8 encode U+FEFF and in fact all BMP code points with the same bytes. The |
| opportunity for misidentification of one as the other is one of the reasons why |
| CESU-8 should only be used in limited, closed, specific environments.* |
| |
| In UTF-16 and UTF-32, where the signature also distinguishes between big-endian |
| and little-endian byte orders, it is also called a byte order mark (BOM). The |
| signature works for UTF-16 since the code point that has the byte-swapped |
| encoding, FFFE<sub>16</sub>, will never be a valid Unicode character. (It is a |
| "non-character" code point.) In Internet protocols, if an encoding specification |
| of "UTF-16" or "UTF-32" is used, it is expected that there is a signature byte |
| sequence (BOM) that identifies the byte ordering, which is not the case for the |
| encoding scheme/charset names with "BE" or "LE". |
| |
| *If text is specified to be encoded in the UTF-16 or UTF-32 charset and does not |
| begin with a BOM, then it must be interpreted as UTF-16BE or UTF-32BE, |
| respectively.* |
| |
| A signature is not part of the content, and must be stripped when processing. |
| For example, blindly concatenating two files will give an incorrect result. |
| |
| If a signature was detected, then the signature "character" U+FEFF should be |
| removed from the Unicode stream **after** conversion. Removing the signature |
| bytes before conversion could cause the conversion to fail for stateful |
| encodings like BOCU-1 and UTF-7. |
| |
| Whether a signature is to be recognized or not depends on the protocol or |
| application. |
| |
| 1. If a protocol specifies a charset name, then the byte stream must be |
| interpreted according to how that name is defined. Only the "UTF-16" and |
| "UTF-32" names include recognition of the byte order marks that are specific |
| to them (and the ICU converters for these names do this automatically). None |
| of the other Unicode charsets are defined to include any signature/BOM |
| handling. |
| |
| 2. If no charset name is provided, for example for text files in most |
| filesystems, then applications must usually rely on heuristics to determine |
| the file encoding. Many document formats contain an embedded or implicit |
| encoding declaration, but for plain text files it is reasonable to use |
| Unicode signatures as simple and reliable heuristics. This is especially |
| common on Windows systems. However, some tools for plain text file handling |
| (e.g., many Unix command line tools) are not prepared for Unicode |
| signatures. |
| |
| ## The Unicode Standard Is An Industry Standard |
| |
| The Unicode standard is an industry standard and parallels ISO 10646-1. Around |
| 1993, these two standards were effectively merged into the same character set |
| standard. Both standards have the same character repertoire and the same |
| encoding forms and schemes. |
| |
| One difference used to be that the ISO standard defined code point values to be |
| from 0 to 7FFFFFFF<sub>16</sub>, not just up to 10FFFF<sub>16</sub>. The ISO work group decided to add |
| an amendment to the standard. The amendment removes this difference by declaring |
| that no characters will ever be assigned code points above 10FFFF<sub>16</sub>. The main |
| reason for the ISO work group's decision is interoperability between the UTFs. |
| UTF-16 can not encode any code points above this limit. |
| |
| This means that the code point space for both Unicode and ISO 10646 is now the |
| same! **These changes to ISO 10646 have been made recently and should be |
| complete in the edition ISO 10646:2003 which also combines all parts of the |
| standard into one.** |
| |
| The former, larger code space is the reason why the ISO definition of UTF-8 |
| specifies sequences of five and six bytes to cover that whole range. |
| |
| Another difference is that the ISO standard defines encoding forms "UCS-4" and |
| "UCS-2". UCS-4 is essentially UTF-32 with a theoretical upper limit of |
| 7FFFFFFF<sub>16</sub>, using 31 out of the 32 bits. However, in practice, the ISO committee |
| has accepted that the characters above 10FFFF will not be encoded, so there is |
| essentially no difference between the forms. The "4" stands for "four-byte |
| form". |
| |
| UCS-2 is a subset of UTF-16 that is limited to code points from 0 to FFFF, |
| excluding the surrogate code points. Thus, it cannot represent the characters |
| with code points above FFFF (called supplementary characters). |
| |
| *There is no conversion necessary between UCS-2 and UTF-16. The difference is |
| only in the interpretation of surrogates.* |
| |
| The standards differ in what kind of information they provide: The Unicode |
| standard provides more character properties and describes algorithms etc., while |
| the ISO standard defines collections, subsets and similar. |
| |
| The standards are synchronized, and the respective committees work together to |
| add new characters and assign code point values. |