docs/userguide/transforms/casemappings.md - external/github.com/unicode-org/icu - Git at Google

 ---
 layout: default
 title: Case Mappings
 nav_order: 1
 parent: Transforms
 ---
 <!--
 © 2020 and later: Unicode, Inc. and others.
 License & terms of use: http://www.unicode.org/copyright.html
 -->

 # Case Mappings
 {: .no_toc }

 ## Contents
 {: .no_toc .text-delta }

 1. TOC
 {:toc}

 ---

 ## Overview

 Case mapping is used to handle the mapping of upper-case, lower-case, and title
 case characters for a given language. Case is a normative property of characters
 in specific alphabets (e.g. Latin, Greek, Cyrillic, Armenian, and Georgian)
 whereby characters are considered to be variants of a single letter. ICU refers
 to these variants, which may differ markedly in shape and size, as uppercase
 letters (also known as capital or majuscule) and lower-case letters (also known
 as small or minuscule). Alphabets with case differences are called bicameral and
 alphabets without case differences are called unicameral.

 Due to the inclusion of certain composite characters for compatibility, such as
 the Latin capital letter 'DZ' (\\u01F1 'DZ'), there is a third case called title
 case. Title case is used to capitalize the first character of a word such as the
 Latin capital letter 'D' with small letter 'z' ( \\u01F2 'Dz'). The term "title
 case" can also be used to refer to words whose first letter is an uppercase or
 title case letter and the rest are lowercase letters. However, not all words in
 the title of a document or first words in a sentence will be title case. The use
 of title case words is language dependent. For example, in English, "Taming of
 the Shrew" would be the appropriate capitalization and not "Taming Of The
 Shrew".

 > :point_right: **Note**: *As of Unicode 11, Georgian now has Mkhedruli (lowercase) and Mtavruli
 (uppercase) which form case pairs, but are not used in title case.*

 Sample code is available in the ICU source code library at
 [icu/source/samples/ustring/ustring.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/samples/ustring/ustring.cpp)
 .

 Please refer to the following sections in the [The Unicode Standard](http://www.unicode.org/versions/latest/)
 for more information about case mapping:

 *   3.13 Default Case Algorithms
 *   4.2 Case
 *   5.18 Case Mappings

 ## Simple (Single-Character) Case Mapping

 The general case mapping in ICU is non-language based and a 1 to 1 generic
 character map.

 A character is considered to have a lowercase, uppercase, or title case
 equivalent if there is a respective "simple" case mapping specified for the
 character in the [Unicode Character Database](http://www.unicode.org/ucd/) (UnicodeData.txt).
 If a character has no mapping equivalent, the result is the character itself.

 The APIs provided for the general case mapping, located in `uchar.h` file, handles
 only single characters of type `UChar32` and returns only single characters. To
 convert a string to a non-language based specific case, use the APIs in either
 the `unistr.h` or `ustring.h` files with a `NULL` argument locale.

 ## Full (Language-Specific) Case Mapping

 There are different case mappings for different locales. For instance, unlike
 English, the character Latin small letter 'i' in Turkish has an equivalent Latin
 capital letter 'I' with dot above ( \\u0130 'İ').

 Similar to the simple case mapping API, a character is considered to have a
 lowercase, uppercase or title case equivalent if there is a respective mapping
 specified for the character in the Unicode Character database (UnicodeData.txt).
 In the case where a character has no mapping equivalent, the result is the
 character itself.

 To convert a string to a language based specific case, use the APIs in `ustring.h`
 and `unistr.h` with an intended argument locale.

 ICU implements full Unicode string case mappings.

 **In general:**

 *   **case mapping can change the number of code points and/or code units of a
     string,**
 *   **is language-sensitive (results may differ depending on language), and**
 *   **is context-sensitive (a character in the input string may map differently
     depending on surrounding characters).**

 ## Case Folding

 Case folding maps strings to a canonical form where case differences are erased.
 Using the case folding API, ICU supports fast matches without regard to case in
 lookups, since only binary comparison is required.

 The CaseFolding.txt file in the Unicode Character Database is used for
 performing locale-independent case folding. This text file is generated from the
 case mappings in the Unicode Character Database, using both the single-character
 and the multi-character mappings. The CaseFolding.txt file transforms all
 characters having different case forms into a common form. To compare two
 strings for non-case-sensitive matching, you can transform each string and then
 use a binary comparison. There are also functions to compare two strings
 case-insensitively using the same case folding data.

 Unicode case folding is not context-sensitive. It is also not
 language-sensitive, although there is a flag for whether to apply special
 mappings for use with Turkic (Turkish/Azerbaijani) text data.

 Character case folding APIs implementations are located in:

 1.  `uchar.h` for single character folding

 2.  `ustring.h` and `unistr.h` for character string folding.
	---
	layout: default
	title: Case Mappings
	nav_order: 1
	parent: Transforms
	---
	<!--
	© 2020 and later: Unicode, Inc. and others.
	License & terms of use: http://www.unicode.org/copyright.html
	-->

	# Case Mappings
	{: .no_toc }

	## Contents
	{: .no_toc .text-delta }

	1. TOC
	{:toc}

	---

	## Overview

	Case mapping is used to handle the mapping of upper-case, lower-case, and title
	case characters for a given language. Case is a normative property of characters
	in specific alphabets (e.g. Latin, Greek, Cyrillic, Armenian, and Georgian)
	whereby characters are considered to be variants of a single letter. ICU refers
	to these variants, which may differ markedly in shape and size, as uppercase
	letters (also known as capital or majuscule) and lower-case letters (also known
	as small or minuscule). Alphabets with case differences are called bicameral and
	alphabets without case differences are called unicameral.

	Due to the inclusion of certain composite characters for compatibility, such as
	the Latin capital letter 'DZ' (\\u01F1 'DZ'), there is a third case called title
	case. Title case is used to capitalize the first character of a word such as the
	Latin capital letter 'D' with small letter 'z' ( \\u01F2 'Dz'). The term "title
	case" can also be used to refer to words whose first letter is an uppercase or
	title case letter and the rest are lowercase letters. However, not all words in
	the title of a document or first words in a sentence will be title case. The use
	of title case words is language dependent. For example, in English, "Taming of
	the Shrew" would be the appropriate capitalization and not "Taming Of The
	Shrew".

	> :point_right: Note: *As of Unicode 11, Georgian now has Mkhedruli (lowercase) and Mtavruli
	(uppercase) which form case pairs, but are not used in title case.*

	Sample code is available in the ICU source code library at
	[icu/source/samples/ustring/ustring.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/samples/ustring/ustring.cpp)
	.

	Please refer to the following sections in the [The Unicode Standard](http://www.unicode.org/versions/latest/)
	for more information about case mapping:

	* 3.13 Default Case Algorithms
	* 4.2 Case
	* 5.18 Case Mappings

	## Simple (Single-Character) Case Mapping

	The general case mapping in ICU is non-language based and a 1 to 1 generic
	character map.

	A character is considered to have a lowercase, uppercase, or title case
	equivalent if there is a respective "simple" case mapping specified for the
	character in the [Unicode Character Database](http://www.unicode.org/ucd/) (UnicodeData.txt).
	If a character has no mapping equivalent, the result is the character itself.

	The APIs provided for the general case mapping, located in `uchar.h` file, handles
	only single characters of type `UChar32` and returns only single characters. To
	convert a string to a non-language based specific case, use the APIs in either
	the `unistr.h` or `ustring.h` files with a `NULL` argument locale.

	## Full (Language-Specific) Case Mapping

	There are different case mappings for different locales. For instance, unlike
	English, the character Latin small letter 'i' in Turkish has an equivalent Latin
	capital letter 'I' with dot above ( \\u0130 'İ').

	Similar to the simple case mapping API, a character is considered to have a
	lowercase, uppercase or title case equivalent if there is a respective mapping
	specified for the character in the Unicode Character database (UnicodeData.txt).
	In the case where a character has no mapping equivalent, the result is the
	character itself.

	To convert a string to a language based specific case, use the APIs in `ustring.h`
	and `unistr.h` with an intended argument locale.

	ICU implements full Unicode string case mappings.

	In general:

	* **case mapping can change the number of code points and/or code units of a
	string,**
	* is language-sensitive (results may differ depending on language), and
	* **is context-sensitive (a character in the input string may map differently
	depending on surrounding characters).**

	## Case Folding

	Case folding maps strings to a canonical form where case differences are erased.
	Using the case folding API, ICU supports fast matches without regard to case in
	lookups, since only binary comparison is required.

	The CaseFolding.txt file in the Unicode Character Database is used for
	performing locale-independent case folding. This text file is generated from the
	case mappings in the Unicode Character Database, using both the single-character
	and the multi-character mappings. The CaseFolding.txt file transforms all
	characters having different case forms into a common form. To compare two
	strings for non-case-sensitive matching, you can transform each string and then
	use a binary comparison. There are also functions to compare two strings
	case-insensitively using the same case folding data.

	Unicode case folding is not context-sensitive. It is also not
	language-sensitive, although there is a flag for whether to apply special
	mappings for use with Turkic (Turkish/Azerbaijani) text data.

	Character case folding APIs implementations are located in:

	1. `uchar.h` for single character folding

	2. `ustring.h` and `unistr.h` for character string folding.