docs/userguide/services.md - external/github.com/unicode-org/icu - Git at Google

 # ICU Services

 ## Overview of the ICU Services

 ICU enables you to write language-independent C and C++ code that is used on
 separate, localized resources to get language-specific results. ICU supports
 many features, including language-sensitive text, dates, time, numbers,
 currency, message sorting, and searching. ICU provides language-specific results
 for a broad range of languages.

 ### Strings, Properties and CharacterIterator

 ICU provides basic Unicode support for the following:

 *   [Unicode string](strings/index.md)
     ICU includes type definitions for UTF-16 strings and code points. It also
     contains many C u_string functions and the C++ UnicodeString class with many
     additional string functions.

 *   [Unicode properties](strings/properties.md)
     ICU includes the C definitions and functions found in uchar.h as well as
     some macros found in utf.h. It also includes the C++ Unicode class.

 *   [Unicode string iteration](strings/characteriterator.md)
     In C, ICU uses the macros in utf.h for the iteration of strings. In C++, ICU
     uses the characterIterator and its subclasses.

 ### Conversion Basics

 A converter is used to transform text from one encoding type to another. In the
 case of Unicode, ICU transforms text from one encoding codepage to Unicode and
 back. An encoding is a mapping from a given character set definition to the
 actual bits used to represent the data.

 ### Locale and Resources

 The ICU package contains the locale and resource bundles as well as the classes
 that implement them. Also, the ICU package contains the locale data (plain text
 resource bundles) and provides APIs to access and make use of that data in
 various services. Users need to understand these terms and the relationship
 between them.

 A locale identifies a group of users who have similar cultural and linguistic
 expectations for how their computers interact with them and process data. This
 is an abstract concept that is typically expressed by one of the following:

 A locale ID specifies a language and region enabling the software to support
 culturally and linguistically appropriate information for each user. A locale
 object represents a specific geographical, political, or cultural region. As a
 programmatic expression of locale IDs, ICU provides the C++ locale class. In C,
 Application Programming Interfaces (APIs) use simple C strings for locale IDs.

 ICU stores locale-specific data in resource bundles, which provide a general
 mechanism to access strings and other objects for ICU services to perform
 according to locale conventions. ICU contains data for its services to support
 many locales. Resource bundles contain the locale data of applications that use
 ICU. In C++, the **ResourceBundle** implements the locale data. In C, this
 feature is provided by the **ures_** interface.

 In addition to storing system-level data in ICU's resource bundles, applications
 typically also need to use resource bundles of their own to store
 locale-dependent application data. ICU provides the generic resource bundle APIs
 to access these bundles and also provides the tools to build them.

 *Display strings, which are displayed to a user of a program, are bundled in a
 separate file instead of being embedded in the lines of the program.*

 ### Locales and Services

 The interaction between locales and services is fundamental to ICU. Please refer
 to the Locales and Services (§) section of the [Locale](locale/index.md)
 chapter.

 ### Transliteration

 Transliteration was originally designed to convert characters from one script to
 another (for example, from Greek to Latin, or Japanese Katakana to Latin). Now,
 transliteration is a more flexible mechanism that has pre-built transformations
 for case conversions, normalization conversions, the removal of given
 characters, and also for a variety of language and script transliterations.
 Transliterations can be chained together to perform a series of operations and
 each step of the process can use a UnicodeSet to restrict the characters that
 are affected. There are two basic types of transliterators:

 Most natural language transliterators (such as Greek-Latin) are written a
 rule-based transliterators. Transliterators can be written as text files using a
 simple language that is similar to regular expression syntax.

 ### Date and Time Classes

 Date and time routines manage independent date and time functions in
 milliseconds since January 1, 1970 (0:00:00.000 UTC). Points in time before then
 are represented as negative numbers.

 ICU provides the following [classes](datetime/index.md) to support calendars and
 time zones:

 *   [Calendar](datetime/calendar/index.md) (§)
     The abstract superclass for extracting calendar-related attributes from a
     Date value. (§)

 *   [Gregorian Calendar](datetime/calendar/index.md) (§)
     A concrete class for representing a Gregorian calendar.

 *   [TimeZone](datetime/timezone/index.md) (§)
     An abstract superclass for representing a time zone.

 *   [SimpleTimeZone](datetime/timezone/index.md) (§)
     A concrete class for representing a time zone for use with a Gregorian
     calendar.

 *C classes provide the same functionality as the C++ classes with the exception
 of subclassing.*

 ### Format and Parse

 Formatters translate between non-text data values and textual representations of
 those values. The result is a string of text that represents the internal value.
 A formatter can parse a string and convert a textual representation of some
 value (if it finds one it understands) back into its internal representation.
 For example, when the formatter reads the characters 1, 0, and 3 followed by
 something other than a digit, it produces the value 103 in its internal binary
 representation.

 A formatter takes a value and produces a user-readable string that represents
 that value or takes a string and parses it to produce a value.

 ICU provides the following areas and classes for general formatting, formatting
 numbers, formatting dates and times, and formatting messages:

 #### General Formatting

 *   [Format](index.md)
     Format is the abstract superclass of all format classes. It provides the
     basic methods for formatting and parsing numbers, dates, strings, and other
     objects.

 *   [FieldPosition](index.md) (§)
     FieldPosition is a concrete class for holding the field constant and the
     beginning and ending indices for the number and date fields.

 *   [ParsePosition](index.md) (§)
     ParsePosition is a concrete class for holding the parse position in a string
     during parsing.

 *   [Formattable](index.md) (§)
     Objects that must be formatted can be passed to the Format class or its
     subclasses for formatting. The class encapsulates a polymorphic piece of
     data to be formatted and uses the MessageFormat class. Some formatting
     operations use the Formattable class to produce a single "type" that
     encompasses all formattable values such as a number, date, string, and so
     on.

 #### Formatting Numbers

 *   [NumberFormat](formatparse/numbers/index.md) (§)
     NumberFormat provides the basic fields and methods to format number objects
     and number primitives into localized strings and parse localized strings to
     number objects.

 *   [DecimalFormat](formatparse/numbers/index.md) (§)
     DecimalFormat provides the methods used to format number objects and number
     primitives into localized strings and parse localized strings into number
     objects in base 10.

 *   [DecimalFormatSymbols](formatparse/numbers/index.md) (§)
     DecimalFormatSymbols is a concrete class used by DecimalFormat to access
     localized number strings such as the grouping separators, the decimal
     separator, and the percent sign.

 #### Formatting Dates and Times

 *   [DateFormat](formatparse/datetime/index.md) (§)
     DateFormat provides the basic fields and methods for formatting date objects
     to localized strings and parsing date and time strings to date objects.

 *   [SimpleDateFormat](formatparse/datetime/index.md) (§)
     SimpleDateFormat is a concrete class used to format date objects to
     localized strings and to parse date and time strings to date objects using a
     GregorianCalendar.

 *   [DateFormatSymbols](formatparse/datetime/index.md) (§)
     DateFormatSymbols is a concrete class used to access localized date and time
     formatting strings, such as names of the months, days of the week, and the
     time zone.

 #### Formatting Messages

 *   [MessageFormat](formatparse/messages/index.md) (§)
     MessageFormat is a concrete class used to produce a language-specific user
     message that contains numbers, currency, percentages, date, time, and string
     variables.

 *   [ChoiceFormat](formatparse/messages/index.md) (§)
     ChoiceFormat is a concrete class used to map strings to ranges of numbers
     and to handle plural words and name series in user messages.

 *C classes provide the same functionality as the C++ classes with the exception
 of subclassing.*

 ### Searching and Sorting

 Sorting and searching non-English text presents a number of challenges that many
 English speakers are unaware of. The primary source of difficulty is accents,
 which have very different meanings in different languages, and sometimes even
 within the same language:

 *   Many accented letters, such as the é in café, are treated as minor variants
     on the letter that is accented.

 *   Sometimes the accented form of a letter is treated as a distinct letter for
     the purposes of comparison. For example, Å in Danish is treated as a
     separate letter that sorts just after Z.

 *   In some cases, an accented letter is treated as if it were two letters. In
     traditional German, for example, ä is compared as if it were ae.

 Searching and sorting is done through collation using the Collator class and its
 sub-classes RuleBasedCollator and CollationElementIterator as well as the
 CollationKey object. Collation determines the proper sort sequence for two or
 more natural language strings. It also can determine if two strings are
 equivalent for the purpose of searching.

 The Collator class and its sub-class RuleBasedCollator perform locale-sensitive
 string comparisons to create sorting and searching routines for natural language
 text. Collator and RuleBasedCollator can distinguish between characters
 associated with base characters (such as 'a' and 'b'), accent marks (such as
 'ò', 'ó'), and uppercase or lowercase properties (such as 'a' and 'A').

 ICU provides the following collation classes for sorting and searching natural
 language text according to locale-specific rules:

 *   [Collator](collation/architecture.md)
     Collator is the abstract base class of all classes that compare strings.

 *   [CollationElementIterator](collation/architecture.md)
     CollationElementIterator is a concrete iterator class that provides an
     iterator for stepping through each character of a locale-specific string
     according to the rules of a specific collator object.

 *   [RuleBasedCollator](collation/architecture.md)
     RuleBasedCollator is the only built-in implementation of the collator. It
     provides a sophisticated mechanism for comparing strings in a
     language-specific manner, and an interface that allows the user to
     specifically customize the sorting order.

 *   [CollationKey](collation/architecture.md)
     CollationKey is an object that enables the fast sorting of strings by
     representing a string as a sort key under the rules of a specific collator
     object.

 *C classes provide the same functionality as the C++ classes with the exception
 of subclassing.*

 ### Text Analysis

 The BreakIterator services can be used for formatting and handling text;
 locating the beginning and ending points of a word; counting words, sentences,
 and paragraphs; and listing unique words. Specifically, text operations can be
 done to locate the following linguistic boundaries:

 *   Display text on the screen and locate places in the text where the
     BreakIterator can perform word-wrapping to fit the text within the margins

 *   Locate the beginning and end of a word that the user has selected

 *   Count graphemes (or characters), words, sentences, or paragraphs

 *   Determine how far to move in the text store when the user hits an arrow key
     to move forward or backward one grapheme

 *   Make a list of all the unique words in a document

 *   Figure out whether or not a range of text contains only whole words

 *   Capitalize the first letter of each word

 *   Extract a particular unit from the text such as "find me the third grapheme
     in this document"

 The BreakIterator services were designed and developed around an "iterator" or
 "cursor" style of interface. The object points to a particular place in the
 text. You can move the pointer forward or backward to search the text for
 boundaries.

 The BreakIterator class makes it possible to iterate over user characters. A
 BreakIterator can find the location of a character, word, sentence or potential
 line-break boundary. This makes it possible for a software program to properly
 select characters for text operations such as highlighting a character, cutting
 a word, moving to the next sentence, or wrapping words at a line ending.
 BreakIterator performs these operations in a locale-sensitive manner, meaning
 that it recognizes text boundaries according to the particular locale ID.

 ICU provides the following classes for iterating over locale-specific text:

 *   [BreakIterator](boundaryanalysis/index.md)
     The abstract base class that defines the operations for finding and getting
     the positions of logical breaks in a string of text: characters, words,
     sentences, and potential line breaks.

 *   [CharacterIterator](strings/characteriterator.md)
     The abstract base class for forward and backward iteration over a string of
     Unicode characters.

 *   [StringCharacterIterator](strings/index.md)
     A concrete class for forward and backward iteration over a string of Unicode
     characters. StringCharacterIterator inherits from CharacterIterator.

 ### Text Layout

 Some scripts require rendering behavior that is more complicated than the Latin
 script. These scripts are called as "complex scripts" and while their text is
 called "complex text." Examples of complex scripts are the Indic scripts
 (Devanagari, Tamil, Telugu, and Gujarati), Thai scripts, and Arabic scripts.

 Complex text has the following main characteristics:

 In most cases, the contextual and ligature forms of characters have not been
 assigned Unicode codepoints and thus cannot be displayed directly using
 codepoints.

 The ICU LayoutEngine provides a uniform interface for preparing complex text for
 display. The LayoutEngine code is independent of the font and rendering
 architecture of the underlying platform. All access to the LayoutEngine code is
 through an abstract base class. A concrete instance of this base class must be
 implemented for each platform.

 The ICU LayoutEngine prepares complex text using the following procedures:

 ## Locale-Dependent Operations

 Many of the ICU classes are locale-sensitive, meaning that you have to create a
 different one for each locale.

 | C API | C++ Class | Description |
 |----------|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | ubrk_ | BreakIterator | The BreakIterator class implements methods to find the location of boundaries in the text. |
 | ucal_ | Calendar | The Calendar class is an abstract base class that converts between a UDate object and a set of integer fields such as YEAR, MONTH, DAY, HOUR, and so on. |
 | umsg.h | ChoiceFormat | A ChoiceFormat class enables you to attach a format to a range of numbers. |
 | ucol_ | CollationElementIterator | The CollationElementIterator class is used as an iterator to walk through each character of an international string. |
 | ucol_ | CollationKey | The Collator class generates the Collation keys. |
 | ucol_ | Collator | The Collator class performs locale-sensitive string comparison. |
 | udat_ | DateFormat | DateFormat is an abstract class for a family of classes. DateFormat converts dates and times from their internal representations to a textual form that is language-independent, and then back to their internal representations. |
 | udat_ | DateFormatSymbols | DateFormatSymbols is a public class that encapsulates localized date and time formatting data. This information includes time zone information. |
 | unum_ | DecimalFormatSymbols | This class represents the set of symbols needed by DecimalFormat to format numbers. |
 | umsg.h | Format | The Format class is the base class for all formats. |
 | ucal_ | GregorianCalendar | GregorianCalendar is a concrete class that provides the standard calendar used in many locations. |
 | uloc_ | Locale | A Locale object represents a specific geographical, political, or cultural region. |
 | umsg.h | MessageFormat | MessageFormat provides a means to produce concatenated messages in language-neutral way. |
 | unum_ | NumberFormat | NumberFormat is an abstract base class for all number formats. |
 | ures_ | ResourceBundle | ResourceBundle provides a means to access a collection of locale-specific information. |
 | ucol_ | RuleBasedCollator | The RuleBasedCollator provides the implementation of the Collator class using data-driven tables. |
 | udat_ | SimpleDateFormat | SimpleDateFormat is a concrete class used to format and parse dates in a language-independent way. |
 | ucal_ | SimpleTimeZone | SimpleTimeZone is a concrete subclass of TimeZone that represents a time zone for use with a Gregorian calendar. |
 | usearch_ | StringSearch | StringSearch provides a way to search text in a locale sensitive manner. |
 | ucal_ | TimeZone | TimeZone represents a time zone offset, and also determines daylight savings time settings. |

 ## Locale-Independent Operations

 The following ICU services can be used in all locales as they provide
 locale-independent services and users do not need to specify a locale ID:

 | C API | C++ Class | Description |
 |-----------|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | ubidi_ |  | UBiDi is used for implementing the Unicode BiDi algorithm. |
 | utf.h | CharacterIterator | CharacterIterator is an abstract class that defines an API for iteration on text objects. It is an interface for forward and backward iteration and for the random access of a text object. Also, it provides backward compatibility to the Java and older ICU CharacterIterator classes. |
 | n/a | Formattable | Formattable is a thin wrapper class that converts between the primitive numeric types (double, long, and so on) and the UDate and UnicodeString classes. Formattable objects can be passed to the Format class or its subclasses for formatting. |
 | unorm_ | Normalizer | Normalizer transforms Unicode text into an equivalent composed or decomposed form to allow for easier sorting and searching of text. |
 | n/a | ParsePosition | ParsePosition is a simple class used by the Format class and its subclasses to keep track of the current position during parsing. |
 | uidna_ |  | An implementation of the IDNA protocol as defined in RFC 3490. |
 | utf.h | StringCharacterIterator | A concrete subclass of CharacterIterator that iterates over the characters (code units or code points) in a UnicodeString. |
 | utf.h | UCharCharacterIterator | A concrete subclass of CharacterIterator that iterates over the characters (code units or code points) in a UChar array. |
 | uchar.h |  | The Unicode character properties API allows you to query the properties associated with individual Unicode character values. |
 | uregex_ | RegexMatcher | RegexMatcher is a regular expressions implementation. This allows you to perform string matching based upon a pattern. |
 | utrans_ | Transliterator | Transliterator is an abstract class that transliterates text from one format to another. The most common type of transliterator is a script, or an alphabet. |
 | uset_ | UnicodeSet | Objects of the UnicodeSet class represent character classes used in regular expressions. These classes specify a subset of the set of all Unicode characters. This is a mutable set of Unicode characters. |
 | ustring.h | UnicodeString | UnicodeString is a string class that stores Unicode characters directly. This class is a concrete implementation of the abstract class Replaceable. |
 | ushape.h |  | Provides operations to transform (shape) between Arabic characters and their presentation forms. |
 | ucnv_ |  | The Unicode conversion API allows you to convert data written in one codepage/encoding to and from UTF-16. |