docs/userguide/boundaryanalysis/index.md - external/github.com/unicode-org/icu - Git at Google

 # Boundary Analysis

 ## Overview of Text Boundary Analysis

 Text boundary analysis is the process of locating linguistic boundaries while
 formatting and handling text. Examples of this process include:

 1. Locating appropriate points to word-wrap text to fit within specific margins
    while displaying or printing.

 2. Locating the beginning of a word that the user has selected.

 3. Counting characters, words, sentences, or paragraphs.

 4. Determining how far to move the text cursor when the user hits an arrow key
     (Some characters require more than one position in the text store and some
     characters in the text store do not display at all).

 5. Making a list of the unique words in a document.

 6. Figuring out if a given range of text contains only whole words.

 7. Capitalizing the first letter of each word.

 8. Locating a particular unit of the text (For example, finding the third word
     in the document).

 The `BreakIterator` classes were designed to support these kinds of tasks. The
 BreakIterator objects maintain a location between two characters in the text.
 This location will always be a text boundary. Clients can move the location
 forward to the next boundary or backward to the previous boundary. Clients can
 also check if a particular location within a source text is on a boundary or
 find the boundary which is before or after a particular location.

 ## Four Types of BreakIterator

 ICU `BreakIterator`s can be used to locate the following kinds of text boundaries:

 1. Character Boundary

 2. Word Boundary

 3. Line-break Boundary

 4. Sentence Boundary

 Each type of boundary is found in accordance with the rules specified by Unicode
 Standard Annex #29, *Unicode Text Segmentation*
 (<https://unicode.org/reports/tr29/> ) or Unicode Standard Annex #14, *Unicode
 Line Breaking Algorithm* (<https://unicode.org/reports/tr14/>)

 ### Character Boundary

 The character-boundary iterator locates the boundaries according to the rules
 defined in <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>.
 These boundaries try to match what a user would think of as a "character"—a
 basic unit of a writing system for a language—which may be more than just a
 single Unicode code point.

 The letter `Ä`, for example, can be represented in Unicode either with a single
 code-point value or with two code-point values (one representing the `A` and
 another representing the umlaut `¨`). The character-boundary iterator will treat
 either representation as a single character.

 End-user characters, as described above, are also called grapheme clusters, in
 an attempt to limit the confusion caused by multiple meanings for the word
 "character".

 ### Word Boundary

 The word-boundary iterator locates the boundaries of words, for purposes such as
 double click selection or "Find whole words" operations.

 Words boundaries are identified according to the rules in
 <https://www.unicode.org/reports/tr29/#Word_Boundaries>, supplemented by a word
 dictionary for text in Chinese, Japanese, Thai or Khmer. The rules used for
 locating word breaks take into account the alphabets and conventions used by
 different languages.

 Here's an example of a sentence, showing the boundary locations that will be
 identified by a word break iterator:

 > :point_right: **Note**: TODO: An example needs to be added here.

 ### Line-break Boundary

 The line-break iterator locates positions that would be appropriate points to
 wrap lines when displaying the text. The boundary rules are define here:
 <https://www.unicode.org/reports/tr14/>

 This example shows the differences in the break locations produced by word and
 line break iterators:

 > :point_right: **Note**: TODO: An example needs to be added here.

 ### Sentence Boundary

 A sentence-break iterator locates sentence boundaries according to the rules
 defined here: <https://www.unicode.org/reports/tr29/#Sentence_Boundaries>

 ## Dictionary-Based BreakIterator

 Some languages are written without spaces, and word and line breaking requires
 more than rules over character sequences. ICU provides dictionary support for
 word boundaries in Chinese, Japanese, Thai, Lao, Khmer and Burmese.

 Use of the dictionaries is automatic when text in one of the dictionary
 languages is encountered. There is no separate API, and no extra programming
 steps required by applications making use of the dictionaries.

 ## Usage

 To locate boundaries in a document, create a BreakIterator using the
 `BreakIterator::create***Instance` family of methods in C++, or the `ubrk_open()`
 function (C), where "`***`" is `Character`, `Word`, `Line` or `Sentence`,
 depending on the type of iterator wanted. These factory methods also take a
 parameter that specifies the locale for the language of the text to be processed.

 When creating a `BreakIterator`, a locale is also specified, and the behavior of
 the BreakIterator obtained may be specialized in some way for that locale. For
 most locales the default break iterator behavior is used.

 Applications also may register customized BreakIterators for use in specific
 locales. Once such a break iterator has been registered, any requests for break
 iterators for that locale will return copies of the registered break iterator.

 ICU may cache service instances. Therefore, registration should be done during
 startup, before opening services by locale ID.

 In the general-usage-model, applications will use the following basic steps to
 analyze a piece of text for boundaries:

 1. Create a `BreakIterator` with the desired behavior

 2. Use the `setText()` method to set the iterator to analyze a particular piece
    of text.

 3. Locate the desired boundaries using the appropriate combination of `first()`,
    `last()`, `next()`, `previous()`, `preceding()`, and `following()` methods.

 The `setText()` method can be called more than once, allowing reuse of a
 BreakIterator on new pieces of text. Because the creation of a `BreakIterator` can
 be relatively time-consuming, it makes good sense to reuse them when practical.

 The iterator always points to a boundary position between two characters. The
 numerical value of the position, as returned by `current()` is the zero-based
 index of the character following the boundary. Thus a position of zero
 represents a boundary preceding the first character of the text, and a position
 of one represents a boundary between the first and second characters.

 The `first()` and `last()` methods reset the iterator's current position to the
 beginning or end of the text (the beginning and the end are always considered
 boundaries). The `next()` and `previous()` methods advance the iterator one boundary
 forward or backward from the current position. If the `next()` or `previous()`
 methods run off the beginning or end of the text, it returns DONE. The `current()`
 method returns the current position.

 The `following()` and `preceding()` methods are used for random access, to move the
 iterator to an arbitrary position within the text. Since a BreakIterator always
 points to a boundary position, the `following()` and `preceding()` methods will
 never set the iterator to point to the position specified by the caller (even if
 it is, in fact, a boundary position). `BreakIterator` will, however, set the
 iterator to the nearest boundary position before or after the specified
 position.

 `isBoundary()` returns true if the specified position is a boundary.

 ### Thread Safety

 `BreakIterator`s are not thread safe. This is inherit in their design—break
 iterators are stateful, holding a reference to and position in the text, meaning
 that a single instance cannot operate in parallel on multiple texts.

 For concurrent break iteration, each thread must use its own break iterator.
 These can be obtained by creating separate break iterators of the desired type,
 or by initially creating a master break iterator and then creating a clone for
 each thread.

 ### Line Breaking Strictness, a CSS Property

 CSS has the concept of "[Line Breaking
 Strictness](https://www.w3.org/TR/css-text-3/#line-break-property)". This
 property specifies the strictness of line-breaking rules applied within an
 element: especially how wrapping interacts with punctuation and symbols. ICU
 line break iterators can choose a strictness using locale tags:

 | Locale       | Behavior    |
 | ------------ | ----------- |
 | `en@lb=strict` <br/> `ja@lb=strict`  | Breaks text using the most stringent set of line-breaking rules |
 | `en@lb=normal` <br/> `ja@lb=normal`  | Breaks text using the most common set of line-breaking rules. |
 | `en@lb=loose`  <br/> `ja@lb=loose`   | Breaks text using the least restrictive set of line-breaking rules. Typically used for short lines, such as in newspapers. |

 ### Sentence Break Filters

 Sentence breaking can return false positives - an indication that sentence ends
 in an incorrect position - in the presence of abbreviations. For example,
 consider the sentence

 > In the meantime Mr. Weston arrived with his small ship.

 Default sentence break shows a false boundary following the "Mr."

 ICU includes lists of common abbreviations that can be used to filter, to
 ignore, these false sentence boundaries. Filtering is enabled by the presence of
 the `ss` locale tag when creating the break iterator.

 | Locale           | Behavior                                                |
 | ---------------- | ------------------------------------------------------- |
 | `en`             |  no filtering                                           |
 | `en@ss=standard` |  Filter based on common English language abbreviations. |
 | `es@ss=standard` |  Filter with common Spanish abbreviations.              |

 Abbreviation lists are available (as of ICU 64) for English, German, Spanish,
 French, Italian and Portuguese.

 ## Accuracy

 ICU's break iterators are based on the default boundary rules described in the
 Unicode Standard Annexes [14](https://www.unicode.org/reports/tr14/) and
 [29](https://www.unicode.org/unicode/reports/tr29/) . These are relatively
 simple boundary rules that can be implemented efficiently, and are sufficient
 for many purposes and languages. However, some languages and applications will
 require a more sophisticated linguistic analysis of the text in order to find
 boundaries with good accuracy. Such an analysis is not directly available from
 ICU at this time.

 Break Iterators based on custom, user-supplied boundary rules can be created and
 used by applications with requirements that are not met by the standard default
 boundary rules.

 ## BreakIterator Boundary Analysis Examples

 ### Print out all the word-boundary positions in a UnicodeString

 **In C++:**

 ```c++
 void listWordBoundaries(const UnicodeString& s) {
     UErrorCode status = U_ZERO_ERROR;
     BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status);
     bi->setText(s);
     int32_t p = bi->first();
     while (p != BreakIterator::DONE) {
         printf("Boundary at position %d\n", p);
         p = bi->next();
     }
     delete bi;
 }
 ```

 **In C:**

 ```c
 void listWordBoundaries(const UChar* s, int32_t len) {
     UBreakIterator* bi;
     int32_t p;
     UErrorCode err = U_ZERO_ERROR;
     bi = ubrk_open(UBRK_WORD, 0, s, len, &err);
     if (U_FAILURE(err)) return;
     p = ubrk_first(bi);
     while (p != UBRK_DONE) {
         printf("Boundary at position %d\n", p);
         p = ubrk_next(bi);
     }
     ubrk_close(bi);
 }
 ```

 ### Get the boundaries of the word that contains a double-click position

 **In C++:**

 ```c++
 void wordContaining(BreakIterator& wordBrk,
         int32_t idx,
         const UnicodeString& s,
         int32_t& start,
         int32_t& end) {
     // this function is written to assume that we have an
     // appropriate BreakIterator stored in an object or a
     // global variable somewhere-- When possible, programmers
     // should avoid having the create() and delete calls in
     // a function of this nature.
     if (s.isEmpty())
         return;
     wordBrk.setText(s);
     start = wordBrk.preceding(idx + 1);
     end = wordBrk.next();
     // NOTE: for this and similar operations, use preceding() and next()
     // as shown here, not following() and previous(). preceding() is
     // faster than following() and next() is faster than previous()
     // NOTE: By using preceding(idx + 1) above, we're adopting the convention
     // that if the double-click comes right on top of a word boundary, it
     // selects the word that _begins_ on that boundary (preceding(idx) would
     // instead select the word that _ends_ on that boundary).
 }
 ```

 **In C:**

 ```c
 void wordContaining(UBreakIterator* wordBrk,
     int32_t idx,
     const UChar* s,
     int32_t sLen,
     int32_t* start,
     int32_t* end,
     UErrorCode* err) {
     if (wordBrk == NULL || s == NULL || start == NULL || end == NULL) {
         *err = U_ILLEGAL_ARGUMENT_ERROR;
         return;
     }
     ubrk_setText(wordBrk, s, sLen, err);
     if (U_SUCCESS(*err)) {
         *start = ubrk_preceding(wordBrk, idx + 1);
         *end = ubrk_next(wordBrk);
     }
 }
 ```

 ### Check for Whole Words

 Use the following to check if a range of text is a "whole word":

 **In C++:**

 ```c++
 UBool isWholeWord(BreakIterator& wordBrk,
     const UnicodeString& s,
     int32_t start,
     int32_t end) {
     if (s.isEmpty())
         return FALSE;
     wordBrk.setText(s);
     if (!wordBrk.isBoundary(start))
         return FALSE;
     return wordBrk.isBoundary(end);
 }
 ```

 **In C:**

 ```c
 UBool isWholeWord(UBreakIterator* wordBrk,
     const UChar* s,
     int32_t sLen,
     int32_t start,
     int32_t end,
     UErrorCode* err) {
     UBool result = FALSE;
     if (wordBrk == NULL || s == NULL) {
         *err = U_ILLEGAL_ARGUMENT_ERROR;
         return FALSE;
     }
     ubrk_setText(wordBrk, s, sLen, err);
     if (U_SUCCESS(*err)) {
         result = ubrk_isBoundary(wordBrk, start) && ubrk_isBoundary(wordBrk, end);
     }
     return result;
 }
 ```

 Count the words in a document (C++ only):

 ```c++
 int32_t containsLetters(RuleBasedBreakIterator& bi, const UnicodeString& s, int32_t start) {
     bi.setText(s);
     int32_t count = 0;
     while (start != BreakIterator::DONE) {
         int breakType = bi.getRuleStatus();
         if (breakType != UBRK_WORD_NONE) {
             // Exclude spaces, punctuation, and the like.
             // A status value UBRK_WORD_NONE indicates that the boundary does
             // not start a word or number.
             //
             ++count;
         }
         start = bi.next();
     }
     return count;
 }
 ```

 The function `getRuleStatus()` returns an enum giving additional information on
 the text preceding the last break position found. Using this value, it is
 possible to distinguish between numbers, words, words containing kana
 characters, words containing ideographic characters, and non-word characters,
 such as spaces or punctuation. The sample uses the break status value to filter
 out, and not count, boundaries associated with non-word characters.

 ### Word-wrap a document (C++ only)

 The sample function below wraps a paragraph so that each line is less than or
 equal to 72 characters. The function fills in an array passed in by the caller
 with the starting offsets of
 each line in the document. Also, it fills in a second array to track how many
 trailing white space characters there are in the line. For simplicity, it is
 assumed that an outside process has already broken the document into paragraphs.
 For example, it is assumed that every string the function is passed has a single
 newline at the end only.

 ```c++
 int32_t wrapParagraph(const UnicodeString& s,
                    const Locale& locale,
                    int32_t lineStarts[],
                    int32_t trailingwhitespace[],
                    int32_t maxLines,
                    UErrorCode &status) {

     int32_t        numLines = 0;
     int32_t        p, q;
     const int32_t MAX_CHARS_PER_LINE = 72;
     UChar          c;

     BreakIterator *bi = BreakIterator::createLineInstance(locale, status);
     if (U_FAILURE(status)) {
         delete bi;
         return 0;
     }
     bi->setText(s);


     p = 0;
     while (p < s.length()) {
         // jump ahead in the paragraph by the maximum number of
         // characters that will fit
         q = p + MAX_CHARS_PER_LINE;

         // if this puts us on a white space character, a control character
         // (which includes newlines), or a non-spacing mark, seek forward
         // and stop on the next character that is not any of these things
         // since none of these characters will be visible at the end of a
         // line, we can ignore them for the purposes of figuring out how
         // many characters will fit on the line)
         if (q < s.length()) {
             c = s[q];
             while (q < s.length()
                    && (u_isspace(c)
                        || u_charType(c) == U_CONTROL_CHAR
                        || u_charType(c) == U_NON_SPACING_MARK
             )) {
                 ++q;
                 c = s[q];
             }
         }

         // then locate the last legal line-break decision at or before
         // the current position ("at or before" is what causes the "+ 1")
         q = bi->preceding(q + 1);

         // if this causes us to wind back to where we started, then the
         // line has no legal line-break positions. Break the line at
         // the maximum number of characters
         if (q == p) {
             p += MAX_CHARS_PER_LINE;
             lineStarts[numLines] = p;
             trailingwhitespace[numLines] = 0;
             ++numLines;
         }
         // otherwise, we got a good line-break position. Record the start of this
         // line (p) and then seek back from the end of this line (q) until you find
         // a non-white space character (same criteria as above) and
         // record the number of white space characters at the end of the
         // line in the other results array
         else {
             lineStarts[numLines] = p;
             int32_t nextLineStart = q;

             for (q--; q > p; q--) {
                 c = s[q];
                 if (!(u_isspace(c)
                        || u_charType(c) == U_CONTROL_CHAR
                        || u_charType(c) == U_NON_SPACING_MARK)) {
                     break;
                 }
             }
             trailingwhitespace[numLines] = nextLineStart - q -1;
             p = nextLineStart;
            ++numLines;
         }
         if (numLines >= maxLines) {
             break;
         }
     }
     delete bi;
     return numLines;
 }
 ```

 Most text editors would not break lines based on the number of characters on a
 line. Even with a monospaced font, there are still many Unicode characters that
 are not displayed and therefore should be filtered out of the calculation. With
 a proportional font, character widths are added up until a maximum line width is
 exceeded or an end of the paragraph marker is reached.

 Trailing white space does not need to be counted in the line-width measurement
 because it does not need to be displayed at the end of a line. The sample code
 above returns an array of trailing white space values because an external
 rendering process needs to be able to measure the length of the line (without
 the trailing white space) to justify the lines. For example, if the text is
 right-justified, the invisible white space would be drawn outside the margin.
 The line would actually end with the last visible character.

 In either case, the basic principle is to jump ahead in the text to the location
 where the line would break (without taking word breaks into account). Then, move
 backwards using the preceding() method to find the last legal breaking position
 before that location. Iterating straight through the text with next() method
 will generally be slower.

 ## ICU BreakIterator Data Files

 The source code for the ICU break rules for the standard boundary types is
 located in the directory
 [icu4c/source/data/brkitr/rules](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/rules).
 These files will be built, and the corresponding binary state tables
 incorporated into ICU's data, by the standard ICU4C build process.

 The dictionary word lists used by word break, and for some languages, line break
 are in
 [icu4c/source/data/brkitr/dictionaries](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/dictionaries).

 The same data is used by both ICU4C and ICU4J. In the normal ICU build process,
 the source data is processed into a binary form using ICU4C, and the resulting
 binary tables are incorporated into ICU4J.
	# Boundary Analysis

	## Overview of Text Boundary Analysis

	Text boundary analysis is the process of locating linguistic boundaries while
	formatting and handling text. Examples of this process include:

	1. Locating appropriate points to word-wrap text to fit within specific margins
	while displaying or printing.

	2. Locating the beginning of a word that the user has selected.

	3. Counting characters, words, sentences, or paragraphs.

	4. Determining how far to move the text cursor when the user hits an arrow key
	(Some characters require more than one position in the text store and some
	characters in the text store do not display at all).

	5. Making a list of the unique words in a document.

	6. Figuring out if a given range of text contains only whole words.

	7. Capitalizing the first letter of each word.

	8. Locating a particular unit of the text (For example, finding the third word
	in the document).

	The `BreakIterator` classes were designed to support these kinds of tasks. The
	BreakIterator objects maintain a location between two characters in the text.
	This location will always be a text boundary. Clients can move the location
	forward to the next boundary or backward to the previous boundary. Clients can
	also check if a particular location within a source text is on a boundary or
	find the boundary which is before or after a particular location.

	## Four Types of BreakIterator

	ICU `BreakIterator`s can be used to locate the following kinds of text boundaries:

	1. Character Boundary

	2. Word Boundary

	3. Line-break Boundary

	4. Sentence Boundary

	Each type of boundary is found in accordance with the rules specified by Unicode
	Standard Annex #29, Unicode Text Segmentation
	(<https://unicode.org/reports/tr29/> ) or Unicode Standard Annex #14, *Unicode
	Line Breaking Algorithm* (<https://unicode.org/reports/tr14/>)

	### Character Boundary

	The character-boundary iterator locates the boundaries according to the rules
	defined in <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>.
	These boundaries try to match what a user would think of as a "character"—a
	basic unit of a writing system for a language—which may be more than just a
	single Unicode code point.

	The letter `Ä`, for example, can be represented in Unicode either with a single
	code-point value or with two code-point values (one representing the `A` and
	another representing the umlaut `¨`). The character-boundary iterator will treat
	either representation as a single character.

	End-user characters, as described above, are also called grapheme clusters, in
	an attempt to limit the confusion caused by multiple meanings for the word
	"character".

	### Word Boundary

	The word-boundary iterator locates the boundaries of words, for purposes such as
	double click selection or "Find whole words" operations.

	Words boundaries are identified according to the rules in
	<https://www.unicode.org/reports/tr29/#Word_Boundaries>, supplemented by a word
	dictionary for text in Chinese, Japanese, Thai or Khmer. The rules used for
	locating word breaks take into account the alphabets and conventions used by
	different languages.

	Here's an example of a sentence, showing the boundary locations that will be
	identified by a word break iterator:

	> :point_right: Note: TODO: An example needs to be added here.

	### Line-break Boundary

	The line-break iterator locates positions that would be appropriate points to
	wrap lines when displaying the text. The boundary rules are define here:
	<https://www.unicode.org/reports/tr14/>

	This example shows the differences in the break locations produced by word and
	line break iterators:

	> :point_right: Note: TODO: An example needs to be added here.

	### Sentence Boundary

	A sentence-break iterator locates sentence boundaries according to the rules
	defined here: <https://www.unicode.org/reports/tr29/#Sentence_Boundaries>

	## Dictionary-Based BreakIterator

	Some languages are written without spaces, and word and line breaking requires
	more than rules over character sequences. ICU provides dictionary support for
	word boundaries in Chinese, Japanese, Thai, Lao, Khmer and Burmese.

	Use of the dictionaries is automatic when text in one of the dictionary
	languages is encountered. There is no separate API, and no extra programming
	steps required by applications making use of the dictionaries.

	## Usage

	To locate boundaries in a document, create a BreakIterator using the
	`BreakIterator::create***Instance` family of methods in C++, or the `ubrk_open()`
	function (C), where "`***`" is `Character`, `Word`, `Line` or `Sentence`,
	depending on the type of iterator wanted. These factory methods also take a
	parameter that specifies the locale for the language of the text to be processed.

	When creating a `BreakIterator`, a locale is also specified, and the behavior of
	the BreakIterator obtained may be specialized in some way for that locale. For
	most locales the default break iterator behavior is used.

	Applications also may register customized BreakIterators for use in specific
	locales. Once such a break iterator has been registered, any requests for break
	iterators for that locale will return copies of the registered break iterator.

	ICU may cache service instances. Therefore, registration should be done during
	startup, before opening services by locale ID.

	In the general-usage-model, applications will use the following basic steps to
	analyze a piece of text for boundaries:

	1. Create a `BreakIterator` with the desired behavior

	2. Use the `setText()` method to set the iterator to analyze a particular piece
	of text.

	3. Locate the desired boundaries using the appropriate combination of `first()`,
	`last()`, `next()`, `previous()`, `preceding()`, and `following()` methods.

	The `setText()` method can be called more than once, allowing reuse of a
	BreakIterator on new pieces of text. Because the creation of a `BreakIterator` can
	be relatively time-consuming, it makes good sense to reuse them when practical.

	The iterator always points to a boundary position between two characters. The
	numerical value of the position, as returned by `current()` is the zero-based
	index of the character following the boundary. Thus a position of zero
	represents a boundary preceding the first character of the text, and a position
	of one represents a boundary between the first and second characters.

	The `first()` and `last()` methods reset the iterator's current position to the
	beginning or end of the text (the beginning and the end are always considered
	boundaries). The `next()` and `previous()` methods advance the iterator one boundary
	forward or backward from the current position. If the `next()` or `previous()`
	methods run off the beginning or end of the text, it returns DONE. The `current()`
	method returns the current position.

	The `following()` and `preceding()` methods are used for random access, to move the
	iterator to an arbitrary position within the text. Since a BreakIterator always
	points to a boundary position, the `following()` and `preceding()` methods will
	never set the iterator to point to the position specified by the caller (even if
	it is, in fact, a boundary position). `BreakIterator` will, however, set the
	iterator to the nearest boundary position before or after the specified
	position.

	`isBoundary()` returns true if the specified position is a boundary.

	### Thread Safety

	`BreakIterator`s are not thread safe. This is inherit in their design—break
	iterators are stateful, holding a reference to and position in the text, meaning
	that a single instance cannot operate in parallel on multiple texts.

	For concurrent break iteration, each thread must use its own break iterator.
	These can be obtained by creating separate break iterators of the desired type,
	or by initially creating a master break iterator and then creating a clone for
	each thread.

	### Line Breaking Strictness, a CSS Property

	CSS has the concept of "[Line Breaking
	Strictness](https://www.w3.org/TR/css-text-3/#line-break-property)". This
	property specifies the strictness of line-breaking rules applied within an
	element: especially how wrapping interacts with punctuation and symbols. ICU
	line break iterators can choose a strictness using locale tags:

	\| Locale \| Behavior \|
	\| ------------ \| ----------- \|
	\| `en@lb=strict` <br/> `ja@lb=strict` \| Breaks text using the most stringent set of line-breaking rules \|
	\| `en@lb=normal` <br/> `ja@lb=normal` \| Breaks text using the most common set of line-breaking rules. \|
	\| `en@lb=loose` <br/> `ja@lb=loose` \| Breaks text using the least restrictive set of line-breaking rules. Typically used for short lines, such as in newspapers. \|

	### Sentence Break Filters

	Sentence breaking can return false positives - an indication that sentence ends
	in an incorrect position - in the presence of abbreviations. For example,
	consider the sentence

	> In the meantime Mr. Weston arrived with his small ship.

	Default sentence break shows a false boundary following the "Mr."

	ICU includes lists of common abbreviations that can be used to filter, to
	ignore, these false sentence boundaries. Filtering is enabled by the presence of
	the `ss` locale tag when creating the break iterator.

	\| Locale \| Behavior \|
	\| ---------------- \| ------------------------------------------------------- \|
	\| `en` \| no filtering \|
	\| `en@ss=standard` \| Filter based on common English language abbreviations. \|
	\| `es@ss=standard` \| Filter with common Spanish abbreviations. \|

	Abbreviation lists are available (as of ICU 64) for English, German, Spanish,
	French, Italian and Portuguese.

	## Accuracy

	ICU's break iterators are based on the default boundary rules described in the
	Unicode Standard Annexes [14](https://www.unicode.org/reports/tr14/) and
	[29](https://www.unicode.org/unicode/reports/tr29/) . These are relatively
	simple boundary rules that can be implemented efficiently, and are sufficient
	for many purposes and languages. However, some languages and applications will
	require a more sophisticated linguistic analysis of the text in order to find
	boundaries with good accuracy. Such an analysis is not directly available from
	ICU at this time.

	Break Iterators based on custom, user-supplied boundary rules can be created and
	used by applications with requirements that are not met by the standard default
	boundary rules.

	## BreakIterator Boundary Analysis Examples

	### Print out all the word-boundary positions in a UnicodeString

	In C++:

	```c++
	void listWordBoundaries(const UnicodeString& s) {
	UErrorCode status = U_ZERO_ERROR;
	BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status);
	bi->setText(s);
	int32_t p = bi->first();
	while (p != BreakIterator::DONE) {
	printf("Boundary at position %d\n", p);
	p = bi->next();
	}
	delete bi;
	}
	```

	In C:

	```c
	void listWordBoundaries(const UChar* s, int32_t len) {
	UBreakIterator* bi;
	int32_t p;
	UErrorCode err = U_ZERO_ERROR;
	bi = ubrk_open(UBRK_WORD, 0, s, len, &err);
	if (U_FAILURE(err)) return;
	p = ubrk_first(bi);
	while (p != UBRK_DONE) {
	printf("Boundary at position %d\n", p);
	p = ubrk_next(bi);
	}
	ubrk_close(bi);
	}
	```

	### Get the boundaries of the word that contains a double-click position

	In C++:

	```c++
	void wordContaining(BreakIterator& wordBrk,
	int32_t idx,
	const UnicodeString& s,
	int32_t& start,
	int32_t& end) {
	// this function is written to assume that we have an
	// appropriate BreakIterator stored in an object or a
	// global variable somewhere-- When possible, programmers
	// should avoid having the create() and delete calls in
	// a function of this nature.
	if (s.isEmpty())
	return;
	wordBrk.setText(s);
	start = wordBrk.preceding(idx + 1);
	end = wordBrk.next();
	// NOTE: for this and similar operations, use preceding() and next()
	// as shown here, not following() and previous(). preceding() is
	// faster than following() and next() is faster than previous()
	// NOTE: By using preceding(idx + 1) above, we're adopting the convention
	// that if the double-click comes right on top of a word boundary, it
	// selects the word that _begins_ on that boundary (preceding(idx) would
	// instead select the word that _ends_ on that boundary).
	}
	```

	In C:

	```c
	void wordContaining(UBreakIterator* wordBrk,
	int32_t idx,
	const UChar* s,
	int32_t sLen,
	int32_t* start,
	int32_t* end,
	UErrorCode* err) {
	if (wordBrk == NULL \|\| s == NULL \|\| start == NULL \|\| end == NULL) {
	*err = U_ILLEGAL_ARGUMENT_ERROR;
	return;
	}
	ubrk_setText(wordBrk, s, sLen, err);
	if (U_SUCCESS(*err)) {
	*start = ubrk_preceding(wordBrk, idx + 1);
	*end = ubrk_next(wordBrk);
	}
	}
	```

	### Check for Whole Words

	Use the following to check if a range of text is a "whole word":

	In C++:

	```c++
	UBool isWholeWord(BreakIterator& wordBrk,
	const UnicodeString& s,
	int32_t start,
	int32_t end) {
	if (s.isEmpty())
	return FALSE;
	wordBrk.setText(s);
	if (!wordBrk.isBoundary(start))
	return FALSE;
	return wordBrk.isBoundary(end);
	}
	```

	In C:

	```c
	UBool isWholeWord(UBreakIterator* wordBrk,
	const UChar* s,
	int32_t sLen,
	int32_t start,
	int32_t end,
	UErrorCode* err) {
	UBool result = FALSE;
	if (wordBrk == NULL \|\| s == NULL) {
	*err = U_ILLEGAL_ARGUMENT_ERROR;
	return FALSE;
	}
	ubrk_setText(wordBrk, s, sLen, err);
	if (U_SUCCESS(*err)) {
	result = ubrk_isBoundary(wordBrk, start) && ubrk_isBoundary(wordBrk, end);
	}
	return result;
	}
	```

	Count the words in a document (C++ only):

	```c++
	int32_t containsLetters(RuleBasedBreakIterator& bi, const UnicodeString& s, int32_t start) {
	bi.setText(s);
	int32_t count = 0;
	while (start != BreakIterator::DONE) {
	int breakType = bi.getRuleStatus();
	if (breakType != UBRK_WORD_NONE) {
	// Exclude spaces, punctuation, and the like.
	// A status value UBRK_WORD_NONE indicates that the boundary does
	// not start a word or number.
	//
	++count;
	}
	start = bi.next();
	}
	return count;
	}
	```

	The function `getRuleStatus()` returns an enum giving additional information on
	the text preceding the last break position found. Using this value, it is
	possible to distinguish between numbers, words, words containing kana
	characters, words containing ideographic characters, and non-word characters,
	such as spaces or punctuation. The sample uses the break status value to filter
	out, and not count, boundaries associated with non-word characters.

	### Word-wrap a document (C++ only)

	The sample function below wraps a paragraph so that each line is less than or
	equal to 72 characters. The function fills in an array passed in by the caller
	with the starting offsets of
	each line in the document. Also, it fills in a second array to track how many
	trailing white space characters there are in the line. For simplicity, it is
	assumed that an outside process has already broken the document into paragraphs.
	For example, it is assumed that every string the function is passed has a single
	newline at the end only.

	```c++
	int32_t wrapParagraph(const UnicodeString& s,
	const Locale& locale,
	int32_t lineStarts[],
	int32_t trailingwhitespace[],
	int32_t maxLines,
	UErrorCode &status) {

	int32_t numLines = 0;
	int32_t p, q;
	const int32_t MAX_CHARS_PER_LINE = 72;
	UChar c;

	BreakIterator *bi = BreakIterator::createLineInstance(locale, status);
	if (U_FAILURE(status)) {
	delete bi;
	return 0;
	}
	bi->setText(s);


	p = 0;
	while (p < s.length()) {
	// jump ahead in the paragraph by the maximum number of
	// characters that will fit
	q = p + MAX_CHARS_PER_LINE;

	// if this puts us on a white space character, a control character
	// (which includes newlines), or a non-spacing mark, seek forward
	// and stop on the next character that is not any of these things
	// since none of these characters will be visible at the end of a
	// line, we can ignore them for the purposes of figuring out how
	// many characters will fit on the line)
	if (q < s.length()) {
	c = s[q];
	while (q < s.length()
	&& (u_isspace(c)
	\|\| u_charType(c) == U_CONTROL_CHAR
	\|\| u_charType(c) == U_NON_SPACING_MARK
	)) {
	++q;
	c = s[q];
	}
	}

	// then locate the last legal line-break decision at or before
	// the current position ("at or before" is what causes the "+ 1")
	q = bi->preceding(q + 1);

	// if this causes us to wind back to where we started, then the
	// line has no legal line-break positions. Break the line at
	// the maximum number of characters
	if (q == p) {
	p += MAX_CHARS_PER_LINE;
	lineStarts[numLines] = p;
	trailingwhitespace[numLines] = 0;
	++numLines;
	}
	// otherwise, we got a good line-break position. Record the start of this
	// line (p) and then seek back from the end of this line (q) until you find
	// a non-white space character (same criteria as above) and
	// record the number of white space characters at the end of the
	// line in the other results array
	else {
	lineStarts[numLines] = p;
	int32_t nextLineStart = q;

	for (q--; q > p; q--) {
	c = s[q];
	if (!(u_isspace(c)
	\|\| u_charType(c) == U_CONTROL_CHAR
	\|\| u_charType(c) == U_NON_SPACING_MARK)) {
	break;
	}
	}
	trailingwhitespace[numLines] = nextLineStart - q -1;
	p = nextLineStart;
	++numLines;
	}
	if (numLines >= maxLines) {
	break;
	}
	}
	delete bi;
	return numLines;
	}
	```

	Most text editors would not break lines based on the number of characters on a
	line. Even with a monospaced font, there are still many Unicode characters that
	are not displayed and therefore should be filtered out of the calculation. With
	a proportional font, character widths are added up until a maximum line width is
	exceeded or an end of the paragraph marker is reached.

	Trailing white space does not need to be counted in the line-width measurement
	because it does not need to be displayed at the end of a line. The sample code
	above returns an array of trailing white space values because an external
	rendering process needs to be able to measure the length of the line (without
	the trailing white space) to justify the lines. For example, if the text is
	right-justified, the invisible white space would be drawn outside the margin.
	The line would actually end with the last visible character.

	In either case, the basic principle is to jump ahead in the text to the location
	where the line would break (without taking word breaks into account). Then, move
	backwards using the preceding() method to find the last legal breaking position
	before that location. Iterating straight through the text with next() method
	will generally be slower.

	## ICU BreakIterator Data Files

	The source code for the ICU break rules for the standard boundary types is
	located in the directory
	[icu4c/source/data/brkitr/rules](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/rules).
	These files will be built, and the corresponding binary state tables
	incorporated into ICU's data, by the standard ICU4C build process.

	The dictionary word lists used by word break, and for some languages, line break
	are in
	[icu4c/source/data/brkitr/dictionaries](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/dictionaries).

	The same data is used by both ICU4C and ICU4J. In the normal ICU build process,
	the source data is processed into a binary form using ICU4C, and the resulting
	binary tables are incorporated into ICU4J.