blob: 3b1dedd72b166ff4f164ee60e2d7bf06cc3ced66 [file] [log] [blame] [view]
---
layout: default
title: Case Mappings
nav_order: 1
parent: Transforms
---
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Case Mappings
{: .no_toc }
## Contents
{: .no_toc .text-delta }
1. TOC
{:toc}
---
## Overview
Case mapping is used to handle the mapping of upper-case, lower-case, and title
case characters for a given language. Case is a normative property of characters
in specific alphabets (e.g. Latin, Greek, Cyrillic, Armenian, and Georgian)
whereby characters are considered to be variants of a single letter. ICU refers
to these variants, which may differ markedly in shape and size, as uppercase
letters (also known as capital or majuscule) and lower-case letters (also known
as small or minuscule). Alphabets with case differences are called bicameral and
alphabets without case differences are called unicameral.
Due to the inclusion of certain composite characters for compatibility, such as
the Latin capital letter 'DZ' (\\u01F1 'DZ'), there is a third case called title
case. Title case is used to capitalize the first character of a word such as the
Latin capital letter 'D' with small letter 'z' ( \\u01F2 'Dz'). The term "title
case" can also be used to refer to words whose first letter is an uppercase or
title case letter and the rest are lowercase letters. However, not all words in
the title of a document or first words in a sentence will be title case. The use
of title case words is language dependent. For example, in English, "Taming of
the Shrew" would be the appropriate capitalization and not "Taming Of The
Shrew".
> :point_right: **Note**: *As of Unicode 11, Georgian now has Mkhedruli (lowercase) and Mtavruli
(uppercase) which form case pairs, but are not used in title case.*
Sample code is available in the ICU source code library at
[icu/source/samples/ustring/ustring.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/samples/ustring/ustring.cpp)
.
Please refer to the following sections in the [The Unicode Standard](http://www.unicode.org/versions/latest/)
for more information about case mapping:
* 3.13 Default Case Algorithms
* 4.2 Case
* 5.18 Case Mappings
## Simple (Single-Character) Case Mapping
The general case mapping in ICU is non-language based and a 1 to 1 generic
character map.
A character is considered to have a lowercase, uppercase, or title case
equivalent if there is a respective "simple" case mapping specified for the
character in the [Unicode Character Database](http://www.unicode.org/ucd/) (UnicodeData.txt).
If a character has no mapping equivalent, the result is the character itself.
The APIs provided for the general case mapping, located in `uchar.h` file, handles
only single characters of type `UChar32` and returns only single characters. To
convert a string to a non-language based specific case, use the APIs in either
the `unistr.h` or `ustring.h` files with a `NULL` argument locale.
## Full (Language-Specific) Case Mapping
There are different case mappings for different locales. For instance, unlike
English, the character Latin small letter 'i' in Turkish has an equivalent Latin
capital letter 'I' with dot above ( \\u0130 'İ').
Similar to the simple case mapping API, a character is considered to have a
lowercase, uppercase or title case equivalent if there is a respective mapping
specified for the character in the Unicode Character database (UnicodeData.txt).
In the case where a character has no mapping equivalent, the result is the
character itself.
To convert a string to a language based specific case, use the APIs in `ustring.h`
and `unistr.h` with an intended argument locale.
ICU implements full Unicode string case mappings.
**In general:**
* **case mapping can change the number of code points and/or code units of a
string,**
* **is language-sensitive (results may differ depending on language), and**
* **is context-sensitive (a character in the input string may map differently
depending on surrounding characters).**
## Case Folding
Case folding maps strings to a canonical form where case differences are erased.
Using the case folding API, ICU supports fast matches without regard to case in
lookups, since only binary comparison is required.
The CaseFolding.txt file in the Unicode Character Database is used for
performing locale-independent case folding. This text file is generated from the
case mappings in the Unicode Character Database, using both the single-character
and the multi-character mappings. The CaseFolding.txt file transforms all
characters having different case forms into a common form. To compare two
strings for non-case-sensitive matching, you can transform each string and then
use a binary comparison. There are also functions to compare two strings
case-insensitively using the same case folding data.
Unicode case folding is not context-sensitive. It is also not
language-sensitive, although there is a flag for whether to apply special
mappings for use with Turkic (Turkish/Azerbaijani) text data.
Character case folding APIs implementations are located in:
1. `uchar.h` for single character folding
2. `ustring.h` and `unistr.h` for character string folding.