| # BiDi Algorithm |
| |
| ## Overview |
| |
| Bidirectional text consists of mainly right-to-left text with some left-to-right |
| nested segments (such as an Arabic text with some information in English), or |
| vice versa (such as an English letter with a Hebrew address nested within it.) |
| The predominant direction is called the global orientation. |
| |
| Languages involving bidirectional text are used mainly in the Middle East. They |
| include Arabic, Urdu, Farsi, Hebrew, and Yiddish. |
| |
| In such a language, the general flow of text proceeds horizontally from right to |
| left, but numbers are written from left to right, the same way as they are |
| written in English. In addition, if some text (addresses, acronyms, or |
| quotations) in English or another left-to-right language is embedded, it is also |
| written from left to right. |
| |
| * Libraries that perform a bidirectional algorithm and reorder strings |
| accordingly are sometimes called "Storage Layout Engines". ICU's BiDi (ubidi.h) |
| and shaping (ushape.h) APIs can be used at the core of such "Storage Layout |
| Engines". * |
| |
| ## Countries with Languages that Require Bidirectional Scripting |
| |
| There are over 300 million people who depend on bidirectional scripts, including |
| Farsi and Urdu which share the same script as Arabic, but have additional |
| characters. |
| |
| | Language | Number of Countries | |
| |----------|------------------------------------------------------| |
| | Arabic | 18 | |
| | Farsi | 1 (Iran) | |
| | Urdu | 2 (India, Pakistan) | |
| | Hebrew | 1 (Israel) | |
| | Yiddish | Israel, North America, South America, Russia, Europe | |
| |
| |
| ## Logical Order versus Visual Order |
| |
| When reading bidirectional text, whenever the eye of the experienced reader |
| encounters an embedded segment, it "automatically" jumps to the other end of the |
| segment and reads it in the opposite direction. The sequence in which the |
| characters are pronounced is thus a logical sequence which differs from the |
| visual sequence in which they are presented on the screen or page. |
| |
| The logical order of bidirectional text is also the order in which it is usually |
| keyed, and in which it is stored in memory. |
| |
| Consider the following example, where Arabic or Hebrew letters are represented |
| by uppercase English letters and English text is represented by lowercase |
| letters: |
| |
| english CIBARA text |
| |
| The English letter h is visually followed by the Arabic letter C, but logically |
| h is followed by the rightmost letter A. The next letter, in logical order, will |
| be R. In other words, the logical and storage order of the same text would be: |
| |
| english ARABIC text |
| |
| Text is stored and processed in logical order to make processing feasible: A |
| contiguous substring of logical-order text (e.g., from a copy&paste operation) |
| contains a logically contiguous piece of the text. For example, "ish ARA" is a |
| logically contiguous piece of the sample text above. By contrast, a contiguous |
| substring of visual-order text may contain pieces of the text from distant parts |
| of a paragraph. ("ish" and "CIB" from the sample text above are not logically |
| adjacent.) Sorting and searching in text (establishing lexical order among |
| strings) as well as any other kind of context-sensitive text analysis also rely |
| on the storage of text in logical order because such processing must match user |
| expectations. |
| |
| When text is displayed or printed, it must be "reordered" into visual order with |
| some parts of the text laid out left-to-right, and other parts laid out |
| right-to-left. The Unicode standard specifies an algorithm for this |
| logical-to-visual reordering. It always works on a paragraph as a whole; the |
| actual positioning of the text on the screen or paper must then take line breaks |
| into account, based on the output of the bidirectional algorithm. The reordering |
| output is also used for cursor movement and selection. |
| |
| Legacy systems frequently stored text in visual order to avoid reordering for |
| display. When exchanging data with such systems for processing in Unicode it is |
| necessary to reorder the data from visual order to logical order and back. Such |
| not-for-display transformations are sometimes referred to as "storage layout" |
| transformations. |
| |
| The are two problems with an "inverse reordering" from visual to logical order: |
| There may be more than one logical order of text that results in the same |
| display (logical-to-visual reordering is a many-to-one function), and there is |
| no standard algorithm for it. ICU's BiDi API provides a setting for "inverse" |
| operation that modifies the standard Unicode Bidi algorithm. However, it may not |
| always produce the expected results. Bidirectional data should be converted to |
| Unicode and reordered to logical order only once to avoid roundtrip losses. Just |
| as it is best to never convert to non-Unicode charsets, data should not be |
| reordered from logical to visual order except for display and printing. |
| |
| ## References |
| |
| ICU provides an implementation of the Unicode BiDi algorithm, as well as simple |
| functions to write a reordered version of the string using the generated |
| meta-data. An "inverse" flag can be set to **approximate** visual-to-logical |
| reordering. See the ubidi.h header file and the [BiDi API |
| References](http://icu-project.org/apiref/icu4c/ubidi_8h.html) . |
| |
| See [Unicode Standard Annex #9: The Bidirectional |
| Algorithm](http://www.unicode.org/unicode/reports/tr9/) . |
| |
| ## Programming Examples in C and C++ |
| |
| See the [BiDi API reference](http://icu-project.org/apiref/icu4c/ubidi_8h.html) |
| for more information. |