| --- |
| layout: default |
| title: Internationalization |
| nav_order: 1 |
| parent: ICU |
| --- |
| <!-- |
| © 2020 and later: Unicode, Inc. and others. |
| License & terms of use: http://www.unicode.org/copyright.html |
| --> |
| |
| # Software Internationalization |
| {: .no_toc } |
| |
| ## Contents |
| {: .no_toc .text-delta } |
| |
| 1. TOC |
| {:toc} |
| |
| --- |
| |
| ## Overview of Software Internationalization |
| |
| Developing globalized software is a continuous balancing act as software |
| developers and project managers inadvertently underestimate the level of effort |
| and detail required to create foreign-language software releases. |
| |
| Software developers must understand the ICU services to design and deploy |
| successful software releases. The services can save ICU users time in dealing |
| with the kinds of problems that typically arise during critical stages of the |
| software life cycle. |
| |
| In general, the standard process for creating globalized software includes |
| "internationalization", which covers generic coding and design issues, and |
| "localization", which involves translating and customizing a product for a |
| specific market. |
| |
| Software developers must understand the intricacies of internationalization |
| since they write the actual underlying code. How well they use established |
| services to achieve mission objectives determines the overall success of the |
| project. At a fundamental level, code and feature design affect how a product is |
| translated and customized. Therefore, software developers need to understand key |
| localization concepts. |
| |
| From a geographic perspective, a locale is a place. From a software perspective, |
| a locale is an ID used to select information associated with a language and/or |
| a place. ICU locale information includes the name and identifier of the spoken |
| language, sorting and collating requirements, currency usage, numeric display |
| preferences, and text direction (left-to-right or right-to-left, horizontal or |
| vertical). |
| |
| General locale-sensitive standards include keyboard layouts, default paper and |
| envelope sizes, common printers and monitor resolutions, character sets or |
| encoding ranges, and input methods. |
| |
| ## ICU Services Overview |
| |
| The ICU services support all major locales with language and sub-language pairs. |
| The sub-language generally corresponds to a country. One way to think of this is |
| in terms of the phrase "X language as spoken in Y country." The way people speak |
| or write a particular language might not change dramatically from one country to |
| the next (for example, German is spoken in Austria, Germany, and Switzerland). |
| However, cultural conventions and national standards often differ a great deal. |
| |
| A key advantage to using the ICU services is the net result in reduced time to |
| market. The translation of the display strings is bundled in separate text files |
| for translation. A programmer team with translators no longer needs to search |
| the source code in order to rewrite the software for each country and language. |
| |
| ## Internationalization and Unicode |
| |
| Unicode enables a program to use a standard encoding scheme for all textual data |
| within the program's environment. Conversion has to be done with incoming and |
| outgoing data only. Operations on the text (while it is in the environment) are |
| simplified since you do not have to keep track of the encoding of a particular |
| text. |
| |
| Unicode supports multilingual data since it encodes characters for all world |
| languages. You do not have to tag pieces of data with their encoding to enable |
| the right characters, and you can mix languages within a single piece of text. |
| |
| Some of the advantages of using ICU to internationalize your program include the |
| following: |
| |
| * It can handle text in any language or combination of languages. |
| |
| * The source code can be written so that the program can work for many |
| locales. |
| |
| * Configurable, pluggable localization is enabled. |
| |
| * Multiple locales are supported at the same time. |
| |
| * Non-technical people can be given access to information and you don't have |
| to open the source code to them. |
| |
| * Software can be developed so that the same code can be ported to various |
| platforms. |
| |
| ## Project Management Tips for Internationalizing Software |
| |
| The following two processes are key when managing, developing and designing a |
| successful internationalization software deliverable: |
| |
| 1. Separate the program's executable code from its UI elements. |
| |
| 2. Avoid making cultural assumptions. |
| |
| Keep static information (such as pictures, window layouts) separate from the |
| program code. Also ensure that the text which the program generates on the fly |
| (such as numbers and dates) comes out in the right language. The text must be |
| formatted correctly for the targeted user community. |
| |
| Make sure the analysis and manipulation of both text and kinds of data |
| (such as dates), is done in a manner that can be easily adapted for different |
| languages and user communities. This includes tasks such as alphabetizing lists |
| and looking for line-break positions. |
| |
| Characters must display on the screen correctly (the text's storage format must |
| be translated to the proper visual images). They must also be accepted as input |
| (translated from keystrokes, voice input or another kind of input into the |
| text's storage format). These processes are relatively easy for English, but |
| quite challenging for other languages. |
| |
| ### Separating Executable Code from UI Elements |
| |
| Good software design requires that the programming code implementing the user |
| interface (UI) be kept separate from code implementing the underlying |
| functionality. The description of the UI must also be kept separate from the |
| code implementing it. |
| |
| The description of the UI contains items that the user sees, including the |
| various messages, buttons, and menu commands. It also contains information about |
| how dialog boxes are to be laid out, and how icons, colors or other visual |
| elements are to be used. For example, German words tend to be longer since they |
| contains grammatical suffixes that English has lost in the last 800 years. The |
| following table shows how word lengths can differ among languages. |
| |
| |English|German|Cyrillic-Serbian| |
| |--------|--------|-------------| |
| |cut|ausschneiden|исеци| |
| |copy|kopieren|копирајpasteeinfügenзалепи| |
| |
| The description of the UI, especially user-visible pieces of text, must be kept |
| together and not embedded in the program's executable code. ICU provides the |
| ResourceBundle services for this purpose. |
| |
| ### Avoiding Cultural/Hidden Assumptions |
| |
| Another difficulty encountered when designing and implementing code is to make |
| it flexible enough to handle different ways of doing things in other countries |
| and cultures. Most programmers make unconscious assumptions about their user's |
| language and customs when they design their programs. For example, in Thailand, |
| the official calendar is the Buddhist calendar and not the Gregorian calendar. |
| |
| These assumptions make it difficult to translate the user interface portion of |
| the code for some user communities without rewriting the underlying program. The |
| ICU libraries provide flexible APIs that can be used to perform the most common |
| and important tasks. They contain pre-built supporting data that enables them to |
| work correctly in 75 languages and more than 200 locales. The key is |
| understanding when, where, why, or how to use the APIs effectively. |
| |
| The remainder of this section provides an overview of some cultural and hidden |
| assumptions components. See a list of topics below: |
| * [Numbers and Dates](#numbers-and-dates) |
| * [Messages](#messages) |
| * [Measuring Units](#measuring-units) |
| * [Alphabetical Order of Characters](#alphabetical-order-of-characters) |
| * [Characters](#characters) |
| * [Text Input and Layout](#text-input-and-layout) |
| * [Text Manipulation](#text-manipulation) |
| * [Date/Time Formatting](#datetime-formatting) |
| * [Distributed Locale Support](#distributed-locale-support) |
| * [LayoutEngine](#layoutengine) |
| |
| #### Numbers and Dates |
| |
| Numbers and dates are represented in different languages. Do not implement |
| routines for converting numbers into strings, and do not call low-level system |
| interfaces like `sprintf()` that do not produce language-sensitive results. |
| Instead, see how ICU's [NumberFormat](format_parse/numbers/index.md) and |
| [DateFormat](format_parse/datetime/index.md) services can be used more |
| effectively. |
| |
| #### Messages |
| |
| Be careful when formulating assumptions about how individual pieces of text are |
| used together to create a complete sentence (for example, when error messages |
| are generated). The elements might go together in a different order if the |
| message is translated into a new language. ICU provides |
| [MessageFormat](format_parse/messages/index.md) (§) and |
| [ChoiceFormat](format_parse/messages/index.md) (§) to help with these |
| occurrences. |
| |
| > :point_right: **Note**: *There also might be situations where parts of the sentence change when other |
| parts of the sentence also change (selecting between singular and plural nouns |
| that go after a number is the most common example). * |
| |
| #### Measuring Units |
| |
| Numerical representations can change with regard to measurement units and |
| currency values. Currency values can vary by country. A good example of this is |
| the representation of $1,000 dollars. This amount can represent either U.S. or |
| Canadian dollar values. US dollars can be displayed as USD while Canadian |
| dollars can be displayed as CAD, depending on the locale. In this case, the |
| displayed numerical quantity might change, and the number itself might also |
| change. [NumberFormat](format_parse/numbers/index.md) provides some support for |
| this. |
| |
| #### Alphabetical Order of Characters |
| |
| All languages (even those using the same alphabet) do not necessarily have the |
| same concept of alphabetical order. Do not assume that alphabetical order is the |
| same as the numerical order of the character's code-point values. In practice, |
| 'a' is distinct from 'A' and 'b' is distinct from 'B'. Each has a different code |
| point . This means that you cannot use a bit-wise lexical comparison (such as |
| what strcmp() provides), to sort user-visible lists. |
| |
| Not all languages interpret the same characters as equivalent. If a character's |
| case is changed it is not always a one-to-one mapping. Accent differences, the |
| presence or absence of certain characters, and even spelling differences might |
| be insignificant when determining whether two strings are equal. The |
| [Collator](collation/index.md) services provide significant help in this area. |
| |
| #### Characters |
| |
| A character does not necessarily correspond to a single code-point position in |
| the backing store. All languages might not have the same definition of a word, |
| and might not find that any group of characters separated by a white space is an |
| acceptable approximation for the definition of a word. ICU provides the |
| [BreakIterator](boundaryanalysis/index.md) services to help locate boundaries or |
| when counting units of text. |
| |
| When checking characters for membership in a particular class, do not list the |
| specific characters you are interested in, and do not assume they come in any |
| particular order in the encoding scheme. For example, /A-Za-z/ does not mean all |
| letters in most European languages, and /0-9/ does not mean all digits in many |
| writing systems. This also holds true when using C interfaces such as `isupper()` |
| and `islower()`. ICU provides a large group of utility functions for testing |
| character properties, such as `u_isupper()` and `u_islower()`. |
| |
| #### Text Input and Layout |
| |
| Do not assume anything about how a piece of text might be drawn on the screen, |
| including how much room it takes up, the direction it flows, or where on the |
| screen it should start. All of these text elements vary according to language. |
| As a result, there might not be a one-to-one relationship between characters and |
| keystrokes. One-to-many, many-to-one, and many-to-many relationships between |
| characters and keystrokes all occur in real text in some languages. |
| |
| #### Text Manipulation |
| |
| Do not assume that all textual data, which the program stores and manipulates, |
| is in any particular language or writing system. ICU provides many methods that |
| help with text storage. The `UnicodeString` class and `u_strxxx` functions are |
| provided for Unicode-based character manipulation. For example, when appending |
| an existing Unicode character buffer, characters can be removed or extracted out |
| of the buffer. |
| |
| A good example of text manipulation is the Rosetta stone. The same text is |
| written on it in Hieroglyphic, Greek and Demotic. ICU provides the services to |
| correctly process multi-lingual text such as this correctly. |
| |
| #### Date/Time Formatting |
| |
| Time can be determined in many units, such as the lengths of months or years, |
| which day is the first day of the week, or the allowable range of values like |
| month and year (with `DateFormat`). It can also determine the time zone you are in |
| (with `TimeZone`), or when daylight-savings time starts. ICU provides the Calendar |
| services needed to handle these issues. |
| |
| #### Distributed Locale Support |
| |
| In most server applications, do not assume that all clients connected to the |
| server interact with their users in the same language. Also do not assume that a |
| session stops and restarts whenever a user speaking one language replaces |
| another user speaking a different language. ICU provides sufficient flexibility |
| for a program to handle multiple locales at the same time. |
| |
| For example, a Web server needs to serve pages to different users, languages, |
| and date formats at the same time. |
| |
| #### LayoutEngine |
| |
| The ICU LayoutEngine is an Open Source library that provides a uniform, easy to |
| use interface for preparing complex scripts or text for display. The Latin |
| script, which is the most commonly used script among software developers, is |
| also the least complex script to display especially when it is used to write |
| English. Using the Latin script, characters can be displayed from left to right |
| in the order that they are stored in memory. Some scripts require rendering |
| behavior that is more complicated than the Latin script. We refer to these |
| scripts as "complex scripts" and to text written in these scripts as "complex |
| text." |