docs/userguide/posix.md - external/github.com/unicode-org/icu - Git at Google

 ---
 layout: default
 title: C/POSIX Migration
 nav_order: 6
 parent: ICU
 ---
 <!--
 © 2020 and later: Unicode, Inc. and others.
 License & terms of use: http://www.unicode.org/copyright.html
 -->

 # C/POSIX Migration
 {: .no_toc }

 ## Contents
 {: .no_toc .text-delta }

 1. TOC
 {:toc}

 ---

 ## Migration from Standard C and POSIX APIs

 The ISO C and POSIX standards define a number of APIs for string handling and
 internationalization in C. They do not support Unicode well because they were
 initially designed before Unicode/ISO 10646 were developed, and the POSIX APIs
 are also problematic for other internationalization aspects.

 This chapter discusses C/POSIX APIs with their problems, and shows which ICU
 APIs to use instead.

 > :point_right:  **Note**: *We use the term "POSIX" to mean the POSIX.1 standard (IEEE Std 1003.1) which
 defines system interfaces and headers with relevance for string handling and
 internationalization. The XPG3, XPG4, Single Unix Specification (SUS) and other
 standards include POSIX.1 as a subset, adding other specifications that are
 irrelevant for this topic.*

 > :construction: This chapter is not complete yet – more POSIX APIs are expected to be discussed
 in the future.

 ## Strings and Characters

 ### Character Sets and Encodings

 #### ISO C

 The ISO C standard provides two basic character types (`char` and `wchar_t`) and
 defines strings as arrays of units of these types. The standard allows nearly
 arbitrary character and string character sets and encodings, which was necessary
 when there was no single character set that worked everywhere.

 For portable C programs, characters and strings are opaque, i.e., a program
 cannot assume that any particular character is represented by any particular
 code or sequence of codes. Programs use standard library functions to handle
 characters and strings. Only a small set of characters — usually the set of
 graphic characters available in US-ASCII — can be reliably accessed via
 character and string literals.

 #### Problems

 1.  Many different encodings are used on each platform, making it difficult for
     multiple programs and libraries to process the same text.

 2.  Programs often need to know the codes of special characters. For example,
     code that parses a filename needs to know how the path and file separators
     are encoded; this is commonly possible because filenames deliberately use
     US-ASCII characters, but any software that uses non-ASCII characters becomes
     platform-dependent. It is practically impossible to provide sophisticated
     text processing without knowledge of the character set, its string encoding,
     and other detailed features.

 3.  The C/POSIX standards only provide a very limited set of useful functions
     for character and string handling; many functions that are provided do not
     work for non-trivial cases.

 4.  While the size of the char type is in practice fixed to 8 bits in modern
     compilers, and its common encodings are reasonably well documented, the size
     of wchar_t varies between 8/16/32 bits depending on the compiler, and only
     few of the string encodings used with it are documented.

 5.  See also [What size wchar_t do I need for
     Unicode?](http://icu-project.org/docs/papers/unicode_wchar_t.html)

 6.  A program based on this model must be recompiled for each platform. Usually,
     it must be recompiled for each supported language or family of languages.

 7.  The ISO C standard basically requires, by how its standard functions are
     defined, that the data type for a single character code in a large character
     set is the same as the string base unit type (wchar_t). This has led to C
     standard library implementations using Unicode encodings which are either
     limited for single-character functions to only part of Unicode, or suffer
     from reduced interoperability with most Unicode-aware software.

 #### ICU

 ICU always processes Unicode text. Unicode covers all languages and allows safe
 hard coding of character codes, in addition to providing many standard or
 recommended algorithms and a lot of useful character property data. See the
 chapters about [Unicode Basics](unicode.md) and [Strings](strings/index.md) and others.

 ICU uses the 16-bit encoding form of Unicode (UTF-16) for processing, making it
 fully interoperable with most Unicode-aware software. See [UTF-16 for
 Processing](http://www.unicode.org/notes/tn12/). In the case of ICU4J, this is
 naturally the case because the Java language and the JDK use UTF-16.

 ICU uses and/or provides direct access to all of the [Unicode
 properties](strings/properties.md) which provide a much finer-grained
 classification of characters than [C/POSIX character
 classes](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/posix_classes.html).

 In C/C++ source code character and string literals, ICU uses only "invariant"
 characters. They are the subset of graphic ASCII characters that are almost
 always encoded with the same byte values on all systems. (One set of byte values
 for ASCII-based systems, and another such set of byte values for EBCDIC
 systems.) See
 [`utypes.h`](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utypes.h)
 for the set of "invariant" characters.

 With the use of Unicode, the implementation of many of the Unicode standard
 algorithms, and its cross-platform availability, ICU provides for consistent,
 portable, and reliable text processing.

 ### Case Mappings

 #### ISO C

 The standard C functions `tolower()`, `toupper()`, etc. take and return one
 character code each.

 #### Problems

 1.  This does not work for German, where the character "ß" (sharp s) uppercases
     to the two characters "SS". (It "expands".)

 2.  It does not work for Greek, where the character "Σ" (capital sigma)
     lowercases to either "ς" (small final sigma) or "σ" (small sigma) depending
     on whether the capital sigma is the last letter in a word. (It is
     context-dependent.)

 3.  It does not work for Lithuanian and Turkic languages where a "combining dot
     above" character may need to be removed in certain cases. (It "contracts"
     and is language- and context-dependent.)

 4.  There are a number of other such cases.

 5.  There are no standard functions for title-casing strings.

 6.  There are no standard functions for case-folding strings. (Case-folding is
     used for case-insensitive comparisons; there are C/POSIX functions for
     direct, case-insensitive comparisons of pairs of strings. Case-folding is
     useful when one string is compared to many others, or as part of a chain of
     transformations of a string.)

 #### ICU

 Case mappings are operations taking and returning strings, to support length
 changes and context dependencies. Unicode provides algorithms and data for
 proper case mappings, and ICU provides APIs for them. (See the API references
 for various string functions and for Transforms/Transliteration.)

 ### Character Classes

 #### ISO C

 The standard C functions isalpha(), isdigit(), etc. take a character code each
 and return boolean values for whether the character belongs to the current
 locale's respective character class.

 #### Problems

 1.  Character classes are bound to locales, instead of providing consistent
     classifications for characters.

 2.  The same character may have different classifications depending on the
     locale and the platform.

 3.  There are only very few POSIX character classes, and they are not well
     defined. For example, there is a class for punctuation characters but not
     one for symbols.

 4.  For example, the dollar symbol (“$”) may or may not belong to the punct
     class depending on the locale, even on the same system.

 5.  The standard allows at most two sets of decimal digits: The digits of the
     “portable character set” (i.e., those in the ASCII repertoire) and one more.
     Some implementations only recognize ASCII digits in the isdigit() function.
     However, there are many sets of decimal digits in a multilingual character
     set like Unicode.

 6.  The POSIX standard assumes that each locale definition file carries the
     character class data for all relevant characters. With many locales using
     overlapping character repertoires, this can lead to a lot of duplication.
     For efficiency, many UTF-8 locales define character classes only for very
     few characters instead of for all of Unicode. For example, some de_DE.utf-8
     locales only define character classes for characters used in German, or for
     the repertoire of ISO 8859-1 – in other words, for only a tiny fraction of
     the representable Unicode repertoire. Processing of text using more than
     this repertoire is not possible with such an implementation.

 7.  For more about the problems with POSIX character classes in a Unicode
     context see [Annex C: Compatibility Properties in Unicode
     Technical Standard #18: Unicode Regular Expressions](http://www.unicode.org/reports/tr18/#Compatibility_Properties)
     and see the mailing list archives for the unicode list (on unicode.org). See
     also the ICU design document about [C/POSIX character
     classes](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/posix_classes.html).

 #### ICU

 ICU provides locale-independent access to all [Unicode
 properties](strings/properties.md) (except Unihan.txt properties), as well as to
 the POSIX character classes, via functions defined in `uchar.h` and in ICU4J's
 `UCharacter` class (see API references) as well as via `UnicodeSet`. The POSIX
 character classes are implemented according to the recommendations in UTS #18.

 The Unicode Character Database defines more than 70 character properties, their
 values are designed for the large character set as well as for real text
 processing, and they are updated with each version of Unicode. The UCD is
 available online, facilitating industry-wide consistency in the implementation
 of Unicode properties.

 ## Formatting and Parsing

 ### Currency Formatting

 #### POSIX

 The `strfmon()` function is used to format monetary values. The default format and
 the currency display symbol or display name are selected by the LC_MONETARY
 locale ID. The number formatting can also be controlled with a formatting string
 resembling what `printf()` uses.

 #### Problems

 1.  Selection of the currency via a locale ID is unreliable: Countries change
     currencies over time, and the locale data for a particular country may not
     be available. This results in using the wrong currency. For example, an
     application may assume that a country has switched from a previous currency
     to the Euro, but it may run on an OS that predates the switch.

 2.  Using a single locale ID for the whole format makes it very difficult to
     format values for multiple currencies with the same number format (for
     example, for an exchange rate list or for showing the price of an item
     adjusted for several currencies). `strfmon()` allows to specify the number
     format fully, but then the application cannot use a country's default number
     format.

 3.  The set of formattable currencies is limited to those that are available via
     locale IDs on a particular system.

 4.  There does not appear to be a function to parse currency values.

 #### ICU

 ICU number formatting APIs have separate, orthogonal settings for the number
 format, which can be selected with a locale ID, and the currency, which is
 specified with an ISO code. See the [Formatting
 Numbers](format_parse/numbers/index.md) chapter for details.
	---
	layout: default
	title: C/POSIX Migration
	nav_order: 6
	parent: ICU
	---
	<!--
	© 2020 and later: Unicode, Inc. and others.
	License & terms of use: http://www.unicode.org/copyright.html
	-->

	# C/POSIX Migration
	{: .no_toc }

	## Contents
	{: .no_toc .text-delta }

	1. TOC
	{:toc}

	---

	## Migration from Standard C and POSIX APIs

	The ISO C and POSIX standards define a number of APIs for string handling and
	internationalization in C. They do not support Unicode well because they were
	initially designed before Unicode/ISO 10646 were developed, and the POSIX APIs
	are also problematic for other internationalization aspects.

	This chapter discusses C/POSIX APIs with their problems, and shows which ICU
	APIs to use instead.

	> :point_right: Note: *We use the term "POSIX" to mean the POSIX.1 standard (IEEE Std 1003.1) which
	defines system interfaces and headers with relevance for string handling and
	internationalization. The XPG3, XPG4, Single Unix Specification (SUS) and other
	standards include POSIX.1 as a subset, adding other specifications that are
	irrelevant for this topic.*

	> :construction: This chapter is not complete yet – more POSIX APIs are expected to be discussed
	in the future.

	## Strings and Characters

	### Character Sets and Encodings

	#### ISO C

	The ISO C standard provides two basic character types (`char` and `wchar_t`) and
	defines strings as arrays of units of these types. The standard allows nearly
	arbitrary character and string character sets and encodings, which was necessary
	when there was no single character set that worked everywhere.

	For portable C programs, characters and strings are opaque, i.e., a program
	cannot assume that any particular character is represented by any particular
	code or sequence of codes. Programs use standard library functions to handle
	characters and strings. Only a small set of characters — usually the set of
	graphic characters available in US-ASCII — can be reliably accessed via
	character and string literals.

	#### Problems

	1. Many different encodings are used on each platform, making it difficult for
	multiple programs and libraries to process the same text.

	2. Programs often need to know the codes of special characters. For example,
	code that parses a filename needs to know how the path and file separators
	are encoded; this is commonly possible because filenames deliberately use
	US-ASCII characters, but any software that uses non-ASCII characters becomes
	platform-dependent. It is practically impossible to provide sophisticated
	text processing without knowledge of the character set, its string encoding,
	and other detailed features.

	3. The C/POSIX standards only provide a very limited set of useful functions
	for character and string handling; many functions that are provided do not
	work for non-trivial cases.

	4. While the size of the char type is in practice fixed to 8 bits in modern
	compilers, and its common encodings are reasonably well documented, the size
	of wchar_t varies between 8/16/32 bits depending on the compiler, and only
	few of the string encodings used with it are documented.

	5. See also [What size wchar_t do I need for
	Unicode?](http://icu-project.org/docs/papers/unicode_wchar_t.html)

	6. A program based on this model must be recompiled for each platform. Usually,
	it must be recompiled for each supported language or family of languages.

	7. The ISO C standard basically requires, by how its standard functions are
	defined, that the data type for a single character code in a large character
	set is the same as the string base unit type (wchar_t). This has led to C
	standard library implementations using Unicode encodings which are either
	limited for single-character functions to only part of Unicode, or suffer
	from reduced interoperability with most Unicode-aware software.

	#### ICU

	ICU always processes Unicode text. Unicode covers all languages and allows safe
	hard coding of character codes, in addition to providing many standard or
	recommended algorithms and a lot of useful character property data. See the
	chapters about [Unicode Basics](unicode.md) and [Strings](strings/index.md) and others.

	ICU uses the 16-bit encoding form of Unicode (UTF-16) for processing, making it
	fully interoperable with most Unicode-aware software. See [UTF-16 for
	Processing](http://www.unicode.org/notes/tn12/). In the case of ICU4J, this is
	naturally the case because the Java language and the JDK use UTF-16.

	ICU uses and/or provides direct access to all of the [Unicode
	properties](strings/properties.md) which provide a much finer-grained
	classification of characters than [C/POSIX character
	classes](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/posix_classes.html).

	In C/C++ source code character and string literals, ICU uses only "invariant"
	characters. They are the subset of graphic ASCII characters that are almost
	always encoded with the same byte values on all systems. (One set of byte values
	for ASCII-based systems, and another such set of byte values for EBCDIC
	systems.) See
	[`utypes.h`](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utypes.h)
	for the set of "invariant" characters.

	With the use of Unicode, the implementation of many of the Unicode standard
	algorithms, and its cross-platform availability, ICU provides for consistent,
	portable, and reliable text processing.

	### Case Mappings

	#### ISO C

	The standard C functions `tolower()`, `toupper()`, etc. take and return one
	character code each.

	#### Problems

	1. This does not work for German, where the character "ß" (sharp s) uppercases
	to the two characters "SS". (It "expands".)

	2. It does not work for Greek, where the character "Σ" (capital sigma)
	lowercases to either "ς" (small final sigma) or "σ" (small sigma) depending
	on whether the capital sigma is the last letter in a word. (It is
	context-dependent.)

	3. It does not work for Lithuanian and Turkic languages where a "combining dot
	above" character may need to be removed in certain cases. (It "contracts"
	and is language- and context-dependent.)

	4. There are a number of other such cases.

	5. There are no standard functions for title-casing strings.

	6. There are no standard functions for case-folding strings. (Case-folding is
	used for case-insensitive comparisons; there are C/POSIX functions for
	direct, case-insensitive comparisons of pairs of strings. Case-folding is
	useful when one string is compared to many others, or as part of a chain of
	transformations of a string.)

	#### ICU

	Case mappings are operations taking and returning strings, to support length
	changes and context dependencies. Unicode provides algorithms and data for
	proper case mappings, and ICU provides APIs for them. (See the API references
	for various string functions and for Transforms/Transliteration.)

	### Character Classes

	#### ISO C

	The standard C functions isalpha(), isdigit(), etc. take a character code each
	and return boolean values for whether the character belongs to the current
	locale's respective character class.

	#### Problems

	1. Character classes are bound to locales, instead of providing consistent
	classifications for characters.

	2. The same character may have different classifications depending on the
	locale and the platform.

	3. There are only very few POSIX character classes, and they are not well
	defined. For example, there is a class for punctuation characters but not
	one for symbols.

	4. For example, the dollar symbol (“$”) may or may not belong to the punct
	class depending on the locale, even on the same system.

	5. The standard allows at most two sets of decimal digits: The digits of the
	“portable character set” (i.e., those in the ASCII repertoire) and one more.
	Some implementations only recognize ASCII digits in the isdigit() function.
	However, there are many sets of decimal digits in a multilingual character
	set like Unicode.

	6. The POSIX standard assumes that each locale definition file carries the
	character class data for all relevant characters. With many locales using
	overlapping character repertoires, this can lead to a lot of duplication.
	For efficiency, many UTF-8 locales define character classes only for very
	few characters instead of for all of Unicode. For example, some de_DE.utf-8
	locales only define character classes for characters used in German, or for
	the repertoire of ISO 8859-1 – in other words, for only a tiny fraction of
	the representable Unicode repertoire. Processing of text using more than
	this repertoire is not possible with such an implementation.

	7. For more about the problems with POSIX character classes in a Unicode
	context see [Annex C: Compatibility Properties in Unicode
	Technical Standard #18: Unicode Regular Expressions](http://www.unicode.org/reports/tr18/#Compatibility_Properties)
	and see the mailing list archives for the unicode list (on unicode.org). See
	also the ICU design document about [C/POSIX character
	classes](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/posix_classes.html).

	#### ICU

	ICU provides locale-independent access to all [Unicode
	properties](strings/properties.md) (except Unihan.txt properties), as well as to
	the POSIX character classes, via functions defined in `uchar.h` and in ICU4J's
	`UCharacter` class (see API references) as well as via `UnicodeSet`. The POSIX
	character classes are implemented according to the recommendations in UTS #18.

	The Unicode Character Database defines more than 70 character properties, their
	values are designed for the large character set as well as for real text
	processing, and they are updated with each version of Unicode. The UCD is
	available online, facilitating industry-wide consistency in the implementation
	of Unicode properties.

	## Formatting and Parsing

	### Currency Formatting

	#### POSIX

	The `strfmon()` function is used to format monetary values. The default format and
	the currency display symbol or display name are selected by the LC_MONETARY
	locale ID. The number formatting can also be controlled with a formatting string
	resembling what `printf()` uses.

	#### Problems

	1. Selection of the currency via a locale ID is unreliable: Countries change
	currencies over time, and the locale data for a particular country may not
	be available. This results in using the wrong currency. For example, an
	application may assume that a country has switched from a previous currency
	to the Euro, but it may run on an OS that predates the switch.

	2. Using a single locale ID for the whole format makes it very difficult to
	format values for multiple currencies with the same number format (for
	example, for an exchange rate list or for showing the price of an item
	adjusted for several currencies). `strfmon()` allows to specify the number
	format fully, but then the application cannot use a country's default number
	format.

	3. The set of formattable currencies is limited to those that are available via
	locale IDs on a particular system.

	4. There does not appear to be a function to parse currency values.

	#### ICU

	ICU number formatting APIs have separate, orthogonal settings for the number
	format, which can be selected with a locale ID, and the currency, which is
	specified with an ISO code. See the [Formatting
	Numbers](format_parse/numbers/index.md) chapter for details.