docs/userguide/transforms/bidi.md - external/github.com/unicode-org/icu - Git at Google

 ---
 layout: default
 title: BiDi Algorithm
 nav_order: 2
 parent: Transforms
 ---
 <!--
 © 2020 and later: Unicode, Inc. and others.
 License & terms of use: http://www.unicode.org/copyright.html
 -->

 # BiDi Algorithm
 {: .no_toc }

 ## Contents
 {: .no_toc .text-delta }

 1. TOC
 {:toc}

 ---

 ## Overview

 Bidirectional text consists of mainly right-to-left text with some left-to-right
 nested segments (such as an Arabic text with some information in English), or
 vice versa (such as an English letter with a Hebrew address nested within it.)
 The predominant direction is called the global orientation.

 Languages involving bidirectional text are used mainly in the Middle East. They
 include Arabic, Urdu, Persian, Hebrew, and Yiddish.

 In such a language, the general flow of text proceeds horizontally from right to
 left, but numbers are written from left to right, the same way as they are
 written in English. In addition, if some text (addresses, acronyms, or
 quotations) in English or another left-to-right language is embedded, it is also
 written from left to right.

 * Libraries that perform a bidirectional algorithm and reorder strings
 accordingly are sometimes called "Storage Layout Engines". ICU's BiDi (ubidi.h)
 and shaping (ushape.h) APIs can be used at the core of such "Storage Layout
 Engines". *

 ## Countries with Languages that Require Bidirectional Scripting

 There are over 600 million people whose languages are written right-to-left, including
 Persian and Urdu which use the Arabic script with additional characters.

 | Language | Countries (examples) |
 |----------|------------------------------------------------------|
 | Arabic   | Egypt, Jordan, Morocco, Saudi Arabia, ... Middle East & North Africa |
 | Persian  | Iran, Afghanistan |
 | Urdu     | India, Pakistan |
 | Hebrew   | Israel |
 | Yiddish  | Israel, North America, South America, Russia, Europe |

 This list of languages is far from complete. Other languages with RTL scripts include
 Divehi (Maldives), Kurdish (Iraq), Kashmiri (India), Sindhi (Pakistan and India), Uighur (China), and Pashto (Afghanistan), etc.

 ## Logical Order versus Visual Order

 When reading bidirectional text, whenever the eye of the experienced reader
 encounters an embedded segment, it "automatically" jumps to the other end of the
 segment and reads it in the opposite direction. The sequence in which the
 characters are pronounced is thus a logical sequence which differs from the
 visual sequence in which they are presented on the screen or page.

 The logical order of bidirectional text is also the order in which it is usually
 keyed, and in which it is stored in memory.

 Consider the following example, where Arabic or Hebrew letters are represented
 by uppercase English letters and English text is represented by lowercase
 letters:

     english CIBARA text

 The English letter h is visually followed by the Arabic letter C, but logically
 h is followed by the rightmost letter A. The next letter, in logical order, will
 be R. In other words, the logical and storage order of the same text would be:

     english ARABIC text

 Text is stored and processed in logical order to make processing feasible: A
 contiguous substring of logical-order text (e.g., from a copy&paste operation)
 contains a logically contiguous piece of the text. For example, "ish ARA" is a
 logically contiguous piece of the sample text above. By contrast, a contiguous
 substring of visual-order text may contain pieces of the text from distant parts
 of a paragraph. ("ish" and "CIB" from the sample text above are not logically
 adjacent.) Sorting and searching in text (establishing lexical order among
 strings) as well as any other kind of context-sensitive text analysis also rely
 on the storage of text in logical order because such processing must match user
 expectations.

 When text is displayed or printed, it must be "reordered" into visual order with
 some parts of the text laid out left-to-right, and other parts laid out
 right-to-left. The Unicode standard specifies an algorithm for this
 logical-to-visual reordering. It always works on a paragraph as a whole; the
 actual positioning of the text on the screen or paper must then take line breaks
 into account, based on the output of the bidirectional algorithm. The reordering
 output is also used for cursor movement and selection.

 Legacy systems frequently stored text in visual order to avoid reordering for
 display. When exchanging data with such systems for processing in Unicode it is
 necessary to reorder the data from visual order to logical order and back. Such
 not-for-display transformations are sometimes referred to as "storage layout"
 transformations.

 The are two problems with an "inverse reordering" from visual to logical order:
 There may be more than one logical order of text that results in the same
 display (logical-to-visual reordering is a many-to-one function), and there is
 no standard algorithm for it. ICU's BiDi API provides a setting for "inverse"
 operation that modifies the standard Unicode Bidi algorithm. However, it may not
 always produce the expected results. Bidirectional data should be converted to
 Unicode and reordered to logical order only once to avoid roundtrip losses. Just
 as it is best to never convert to non-Unicode charsets, data should not be
 reordered from logical to visual order except for display and printing.

 ## References

 ICU provides an implementation of the Unicode BiDi algorithm, as well as simple
 functions to write a reordered version of the string using the generated
 meta-data. An "inverse" flag can be set to **approximate** visual-to-logical
 reordering. See the ubidi.h header file and the [BiDi API
 References](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ubidi_8h.html) .

 See [Unicode Standard Annex #9: The Bidirectional
 Algorithm](http://www.unicode.org/reports/tr9/) .

 ## Programming Examples in C and C++

 See the [BiDi API reference](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ubidi_8h.html)
 for more information.
	---
	layout: default
	title: BiDi Algorithm
	nav_order: 2
	parent: Transforms
	---
	<!--
	© 2020 and later: Unicode, Inc. and others.
	License & terms of use: http://www.unicode.org/copyright.html
	-->

	# BiDi Algorithm
	{: .no_toc }

	## Contents
	{: .no_toc .text-delta }

	1. TOC
	{:toc}

	---

	## Overview

	Bidirectional text consists of mainly right-to-left text with some left-to-right
	nested segments (such as an Arabic text with some information in English), or
	vice versa (such as an English letter with a Hebrew address nested within it.)
	The predominant direction is called the global orientation.

	Languages involving bidirectional text are used mainly in the Middle East. They
	include Arabic, Urdu, Persian, Hebrew, and Yiddish.

	In such a language, the general flow of text proceeds horizontally from right to
	left, but numbers are written from left to right, the same way as they are
	written in English. In addition, if some text (addresses, acronyms, or
	quotations) in English or another left-to-right language is embedded, it is also
	written from left to right.

	* Libraries that perform a bidirectional algorithm and reorder strings
	accordingly are sometimes called "Storage Layout Engines". ICU's BiDi (ubidi.h)
	and shaping (ushape.h) APIs can be used at the core of such "Storage Layout
	Engines". *

	## Countries with Languages that Require Bidirectional Scripting

	There are over 600 million people whose languages are written right-to-left, including
	Persian and Urdu which use the Arabic script with additional characters.

	\| Language \| Countries (examples) \|
	\|----------\|------------------------------------------------------\|
	\| Arabic \| Egypt, Jordan, Morocco, Saudi Arabia, ... Middle East & North Africa \|
	\| Persian \| Iran, Afghanistan \|
	\| Urdu \| India, Pakistan \|
	\| Hebrew \| Israel \|
	\| Yiddish \| Israel, North America, South America, Russia, Europe \|

	This list of languages is far from complete. Other languages with RTL scripts include
	Divehi (Maldives), Kurdish (Iraq), Kashmiri (India), Sindhi (Pakistan and India), Uighur (China), and Pashto (Afghanistan), etc.

	## Logical Order versus Visual Order

	When reading bidirectional text, whenever the eye of the experienced reader
	encounters an embedded segment, it "automatically" jumps to the other end of the
	segment and reads it in the opposite direction. The sequence in which the
	characters are pronounced is thus a logical sequence which differs from the
	visual sequence in which they are presented on the screen or page.

	The logical order of bidirectional text is also the order in which it is usually
	keyed, and in which it is stored in memory.

	Consider the following example, where Arabic or Hebrew letters are represented
	by uppercase English letters and English text is represented by lowercase
	letters:

	english CIBARA text

	The English letter h is visually followed by the Arabic letter C, but logically
	h is followed by the rightmost letter A. The next letter, in logical order, will
	be R. In other words, the logical and storage order of the same text would be:

	english ARABIC text

	Text is stored and processed in logical order to make processing feasible: A
	contiguous substring of logical-order text (e.g., from a copy&paste operation)
	contains a logically contiguous piece of the text. For example, "ish ARA" is a
	logically contiguous piece of the sample text above. By contrast, a contiguous
	substring of visual-order text may contain pieces of the text from distant parts
	of a paragraph. ("ish" and "CIB" from the sample text above are not logically
	adjacent.) Sorting and searching in text (establishing lexical order among
	strings) as well as any other kind of context-sensitive text analysis also rely
	on the storage of text in logical order because such processing must match user
	expectations.

	When text is displayed or printed, it must be "reordered" into visual order with
	some parts of the text laid out left-to-right, and other parts laid out
	right-to-left. The Unicode standard specifies an algorithm for this
	logical-to-visual reordering. It always works on a paragraph as a whole; the
	actual positioning of the text on the screen or paper must then take line breaks
	into account, based on the output of the bidirectional algorithm. The reordering
	output is also used for cursor movement and selection.

	Legacy systems frequently stored text in visual order to avoid reordering for
	display. When exchanging data with such systems for processing in Unicode it is
	necessary to reorder the data from visual order to logical order and back. Such
	not-for-display transformations are sometimes referred to as "storage layout"
	transformations.

	The are two problems with an "inverse reordering" from visual to logical order:
	There may be more than one logical order of text that results in the same
	display (logical-to-visual reordering is a many-to-one function), and there is
	no standard algorithm for it. ICU's BiDi API provides a setting for "inverse"
	operation that modifies the standard Unicode Bidi algorithm. However, it may not
	always produce the expected results. Bidirectional data should be converted to
	Unicode and reordered to logical order only once to avoid roundtrip losses. Just
	as it is best to never convert to non-Unicode charsets, data should not be
	reordered from logical to visual order except for display and printing.

	## References

	ICU provides an implementation of the Unicode BiDi algorithm, as well as simple
	functions to write a reordered version of the string using the generated
	meta-data. An "inverse" flag can be set to approximate visual-to-logical
	reordering. See the ubidi.h header file and the [BiDi API
	References](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ubidi_8h.html) .

	See [Unicode Standard Annex #9: The Bidirectional
	Algorithm](http://www.unicode.org/reports/tr9/) .

	## Programming Examples in C and C++

	See the [BiDi API reference](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ubidi_8h.html)
	for more information.