BiDi Algorithm

Overview

Bidirectional text consists of mainly right-to-left text with some left-to-right nested segments (such as an Arabic text with some information in English), or vice versa (such as an English letter with a Hebrew address nested within it.) The predominant direction is called the global orientation.

Languages involving bidirectional text are used mainly in the Middle East. They include Arabic, Urdu, Farsi, Hebrew, and Yiddish.

In such a language, the general flow of text proceeds horizontally from right to left, but numbers are written from left to right, the same way as they are written in English. In addition, if some text (addresses, acronyms, or quotations) in English or another left-to-right language is embedded, it is also written from left to right.

Libraries that perform a bidirectional algorithm and reorder strings accordingly are sometimes called “Storage Layout Engines”. ICU's BiDi (ubidi.h) and shaping (ushape.h) APIs can be used at the core of such “Storage Layout Engines”. *

Countries with Languages that Require Bidirectional Scripting

There are over 300 million people who depend on bidirectional scripts, including Farsi and Urdu which share the same script as Arabic, but have additional characters.

Language	Number of Countries
Arabic	18
Farsi	1 (Iran)
Urdu	2 (India, Pakistan)
Hebrew	1 (Israel)
Yiddish	Israel, North America, South America, Russia, Europe

Logical Order versus Visual Order

When reading bidirectional text, whenever the eye of the experienced reader encounters an embedded segment, it “automatically” jumps to the other end of the segment and reads it in the opposite direction. The sequence in which the characters are pronounced is thus a logical sequence which differs from the visual sequence in which they are presented on the screen or page.

The logical order of bidirectional text is also the order in which it is usually keyed, and in which it is stored in memory.

Consider the following example, where Arabic or Hebrew letters are represented by uppercase English letters and English text is represented by lowercase letters:

english CIBARA text

The English letter h is visually followed by the Arabic letter C, but logically h is followed by the rightmost letter A. The next letter, in logical order, will be R. In other words, the logical and storage order of the same text would be:

english ARABIC text

Text is stored and processed in logical order to make processing feasible: A contiguous substring of logical-order text (e.g., from a copy&paste operation) contains a logically contiguous piece of the text. For example, “ish ARA” is a logically contiguous piece of the sample text above. By contrast, a contiguous substring of visual-order text may contain pieces of the text from distant parts of a paragraph. (“ish” and “CIB” from the sample text above are not logically adjacent.) Sorting and searching in text (establishing lexical order among strings) as well as any other kind of context-sensitive text analysis also rely on the storage of text in logical order because such processing must match user expectations.

When text is displayed or printed, it must be “reordered” into visual order with some parts of the text laid out left-to-right, and other parts laid out right-to-left. The Unicode standard specifies an algorithm for this logical-to-visual reordering. It always works on a paragraph as a whole; the actual positioning of the text on the screen or paper must then take line breaks into account, based on the output of the bidirectional algorithm. The reordering output is also used for cursor movement and selection.

Legacy systems frequently stored text in visual order to avoid reordering for display. When exchanging data with such systems for processing in Unicode it is necessary to reorder the data from visual order to logical order and back. Such not-for-display transformations are sometimes referred to as “storage layout” transformations.

The are two problems with an “inverse reordering” from visual to logical order: There may be more than one logical order of text that results in the same display (logical-to-visual reordering is a many-to-one function), and there is no standard algorithm for it. ICU's BiDi API provides a setting for “inverse” operation that modifies the standard Unicode Bidi algorithm. However, it may not always produce the expected results. Bidirectional data should be converted to Unicode and reordered to logical order only once to avoid roundtrip losses. Just as it is best to never convert to non-Unicode charsets, data should not be reordered from logical to visual order except for display and printing.

References

ICU provides an implementation of the Unicode BiDi algorithm, as well as simple functions to write a reordered version of the string using the generated meta-data. An “inverse” flag can be set to approximate visual-to-logical reordering. See the ubidi.h header file and the BiDi API References .

See Unicode Standard Annex #9: The Bidirectional Algorithm .

Programming Examples in C and C++

See the BiDi API reference for more information.