blob: 2215bd4be03f04f68bee322fb9f90154a51eafac [file] [log] [blame] [view]
---
layout: default
title: BiDi Algorithm
nav_order: 2
parent: Transforms
---
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# BiDi Algorithm
{: .no_toc }
## Contents
{: .no_toc .text-delta }
1. TOC
{:toc}
---
## Overview
Bidirectional text consists of mainly right-to-left text with some left-to-right
nested segments (such as an Arabic text with some information in English), or
vice versa (such as an English letter with a Hebrew address nested within it.)
The predominant direction is called the global orientation.
Languages involving bidirectional text are used mainly in the Middle East. They
include Arabic, Urdu, Persian, Hebrew, and Yiddish.
In such a language, the general flow of text proceeds horizontally from right to
left, but numbers are written from left to right, the same way as they are
written in English. In addition, if some text (addresses, acronyms, or
quotations) in English or another left-to-right language is embedded, it is also
written from left to right.
* Libraries that perform a bidirectional algorithm and reorder strings
accordingly are sometimes called "Storage Layout Engines". ICU's BiDi (ubidi.h)
and shaping (ushape.h) APIs can be used at the core of such "Storage Layout
Engines". *
## Countries with Languages that Require Bidirectional Scripting
There are over 600 million people whose languages are written right-to-left, including
Persian and Urdu which use the Arabic script with additional characters.
| Language | Countries (examples) |
|----------|------------------------------------------------------|
| Arabic | Egypt, Jordan, Morocco, Saudi Arabia, ... Middle East & North Africa |
| Persian | Iran, Afghanistan |
| Urdu | India, Pakistan |
| Hebrew | Israel |
| Yiddish | Israel, North America, South America, Russia, Europe |
This list of languages is far from complete. Other languages with RTL scripts include
Divehi (Maldives), Kurdish (Iraq), Kashmiri (India), Sindhi (Pakistan and India), Uighur (China), and Pashto (Afghanistan), etc.
## Logical Order versus Visual Order
When reading bidirectional text, whenever the eye of the experienced reader
encounters an embedded segment, it "automatically" jumps to the other end of the
segment and reads it in the opposite direction. The sequence in which the
characters are pronounced is thus a logical sequence which differs from the
visual sequence in which they are presented on the screen or page.
The logical order of bidirectional text is also the order in which it is usually
keyed, and in which it is stored in memory.
Consider the following example, where Arabic or Hebrew letters are represented
by uppercase English letters and English text is represented by lowercase
letters:
english CIBARA text
The English letter h is visually followed by the Arabic letter C, but logically
h is followed by the rightmost letter A. The next letter, in logical order, will
be R. In other words, the logical and storage order of the same text would be:
english ARABIC text
Text is stored and processed in logical order to make processing feasible: A
contiguous substring of logical-order text (e.g., from a copy&paste operation)
contains a logically contiguous piece of the text. For example, "ish ARA" is a
logically contiguous piece of the sample text above. By contrast, a contiguous
substring of visual-order text may contain pieces of the text from distant parts
of a paragraph. ("ish" and "CIB" from the sample text above are not logically
adjacent.) Sorting and searching in text (establishing lexical order among
strings) as well as any other kind of context-sensitive text analysis also rely
on the storage of text in logical order because such processing must match user
expectations.
When text is displayed or printed, it must be "reordered" into visual order with
some parts of the text laid out left-to-right, and other parts laid out
right-to-left. The Unicode standard specifies an algorithm for this
logical-to-visual reordering. It always works on a paragraph as a whole; the
actual positioning of the text on the screen or paper must then take line breaks
into account, based on the output of the bidirectional algorithm. The reordering
output is also used for cursor movement and selection.
Legacy systems frequently stored text in visual order to avoid reordering for
display. When exchanging data with such systems for processing in Unicode it is
necessary to reorder the data from visual order to logical order and back. Such
not-for-display transformations are sometimes referred to as "storage layout"
transformations.
The are two problems with an "inverse reordering" from visual to logical order:
There may be more than one logical order of text that results in the same
display (logical-to-visual reordering is a many-to-one function), and there is
no standard algorithm for it. ICU's BiDi API provides a setting for "inverse"
operation that modifies the standard Unicode Bidi algorithm. However, it may not
always produce the expected results. Bidirectional data should be converted to
Unicode and reordered to logical order only once to avoid roundtrip losses. Just
as it is best to never convert to non-Unicode charsets, data should not be
reordered from logical to visual order except for display and printing.
## References
ICU provides an implementation of the Unicode BiDi algorithm, as well as simple
functions to write a reordered version of the string using the generated
meta-data. An "inverse" flag can be set to **approximate** visual-to-logical
reordering. See the ubidi.h header file and the [BiDi API
References](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ubidi_8h.html) .
See [Unicode Standard Annex #9: The Bidirectional
Algorithm](http://www.unicode.org/reports/tr9/) .
## Programming Examples in C and C++
See the [BiDi API reference](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ubidi_8h.html)
for more information.