| # Boundary Analysis |
| |
| ## Overview of Text Boundary Analysis |
| |
| Text boundary analysis is the process of locating linguistic boundaries while |
| formatting and handling text. Examples of this process include: |
| |
| 1. Locating appropriate points to word-wrap text to fit within specific margins |
| while displaying or printing. |
| |
| 2. Locating the beginning of a word that the user has selected. |
| |
| 3. Counting characters, words, sentences, or paragraphs. |
| |
| 4. Determining how far to move the text cursor when the user hits an arrow key |
| (Some characters require more than one position in the text store and some |
| characters in the text store do not display at all). |
| |
| 5. Making a list of the unique words in a document. |
| |
| 6. Figuring out if a given range of text contains only whole words. |
| |
| 7. Capitalizing the first letter of each word. |
| |
| 8. Locating a particular unit of the text (For example, finding the third word |
| in the document). |
| |
| The `BreakIterator` classes were designed to support these kinds of tasks. The |
| BreakIterator objects maintain a location between two characters in the text. |
| This location will always be a text boundary. Clients can move the location |
| forward to the next boundary or backward to the previous boundary. Clients can |
| also check if a particular location within a source text is on a boundary or |
| find the boundary which is before or after a particular location. |
| |
| ## Four Types of BreakIterator |
| |
| ICU `BreakIterator`s can be used to locate the following kinds of text boundaries: |
| |
| 1. Character Boundary |
| |
| 2. Word Boundary |
| |
| 3. Line-break Boundary |
| |
| 4. Sentence Boundary |
| |
| Each type of boundary is found in accordance with the rules specified by Unicode |
| Standard Annex #29, *Unicode Text Segmentation* |
| (<https://unicode.org/reports/tr29/> ) or Unicode Standard Annex #14, *Unicode |
| Line Breaking Algorithm* (<https://unicode.org/reports/tr14/>) |
| |
| ### Character Boundary |
| |
| The character-boundary iterator locates the boundaries according to the rules |
| defined in <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>. |
| These boundaries try to match what a user would think of as a "character"—a |
| basic unit of a writing system for a language—which may be more than just a |
| single Unicode code point. |
| |
| The letter `Ä`, for example, can be represented in Unicode either with a single |
| code-point value or with two code-point values (one representing the `A` and |
| another representing the umlaut `¨`). The character-boundary iterator will treat |
| either representation as a single character. |
| |
| End-user characters, as described above, are also called grapheme clusters, in |
| an attempt to limit the confusion caused by multiple meanings for the word |
| "character". |
| |
| ### Word Boundary |
| |
| The word-boundary iterator locates the boundaries of words, for purposes such as |
| double click selection or "Find whole words" operations. |
| |
| Words boundaries are identified according to the rules in |
| <https://www.unicode.org/reports/tr29/#Word_Boundaries>, supplemented by a word |
| dictionary for text in Chinese, Japanese, Thai or Khmer. The rules used for |
| locating word breaks take into account the alphabets and conventions used by |
| different languages. |
| |
| Here's an example of a sentence, showing the boundary locations that will be |
| identified by a word break iterator: |
| |
| > :point_right: **Note**: TODO: An example needs to be added here. |
| |
| ### Line-break Boundary |
| |
| The line-break iterator locates positions that would be appropriate points to |
| wrap lines when displaying the text. The boundary rules are define here: |
| <https://www.unicode.org/reports/tr14/> |
| |
| This example shows the differences in the break locations produced by word and |
| line break iterators: |
| |
| > :point_right: **Note**: TODO: An example needs to be added here. |
| |
| ### Sentence Boundary |
| |
| A sentence-break iterator locates sentence boundaries according to the rules |
| defined here: <https://www.unicode.org/reports/tr29/#Sentence_Boundaries> |
| |
| ## Dictionary-Based BreakIterator |
| |
| Some languages are written without spaces, and word and line breaking requires |
| more than rules over character sequences. ICU provides dictionary support for |
| word boundaries in Chinese, Japanese, Thai, Lao, Khmer and Burmese. |
| |
| Use of the dictionaries is automatic when text in one of the dictionary |
| languages is encountered. There is no separate API, and no extra programming |
| steps required by applications making use of the dictionaries. |
| |
| ## Usage |
| |
| To locate boundaries in a document, create a BreakIterator using the |
| `BreakIterator::create***Instance` family of methods in C++, or the `ubrk_open()` |
| function (C), where "`***`" is `Character`, `Word`, `Line` or `Sentence`, |
| depending on the type of iterator wanted. These factory methods also take a |
| parameter that specifies the locale for the language of the text to be processed. |
| |
| When creating a `BreakIterator`, a locale is also specified, and the behavior of |
| the BreakIterator obtained may be specialized in some way for that locale. For |
| most locales the default break iterator behavior is used. |
| |
| Applications also may register customized BreakIterators for use in specific |
| locales. Once such a break iterator has been registered, any requests for break |
| iterators for that locale will return copies of the registered break iterator. |
| |
| ICU may cache service instances. Therefore, registration should be done during |
| startup, before opening services by locale ID. |
| |
| In the general-usage-model, applications will use the following basic steps to |
| analyze a piece of text for boundaries: |
| |
| 1. Create a `BreakIterator` with the desired behavior |
| |
| 2. Use the `setText()` method to set the iterator to analyze a particular piece |
| of text. |
| |
| 3. Locate the desired boundaries using the appropriate combination of `first()`, |
| `last()`, `next()`, `previous()`, `preceding()`, and `following()` methods. |
| |
| The `setText()` method can be called more than once, allowing reuse of a |
| BreakIterator on new pieces of text. Because the creation of a `BreakIterator` can |
| be relatively time-consuming, it makes good sense to reuse them when practical. |
| |
| The iterator always points to a boundary position between two characters. The |
| numerical value of the position, as returned by `current()` is the zero-based |
| index of the character following the boundary. Thus a position of zero |
| represents a boundary preceding the first character of the text, and a position |
| of one represents a boundary between the first and second characters. |
| |
| The `first()` and `last()` methods reset the iterator's current position to the |
| beginning or end of the text (the beginning and the end are always considered |
| boundaries). The `next()` and `previous()` methods advance the iterator one boundary |
| forward or backward from the current position. If the `next()` or `previous()` |
| methods run off the beginning or end of the text, it returns DONE. The `current()` |
| method returns the current position. |
| |
| The `following()` and `preceding()` methods are used for random access, to move the |
| iterator to an arbitrary position within the text. Since a BreakIterator always |
| points to a boundary position, the `following()` and `preceding()` methods will |
| never set the iterator to point to the position specified by the caller (even if |
| it is, in fact, a boundary position). `BreakIterator` will, however, set the |
| iterator to the nearest boundary position before or after the specified |
| position. |
| |
| `isBoundary()` returns true if the specified position is a boundary. |
| |
| ### Thread Safety |
| |
| `BreakIterator`s are not thread safe. This is inherit in their design—break |
| iterators are stateful, holding a reference to and position in the text, meaning |
| that a single instance cannot operate in parallel on multiple texts. |
| |
| For concurrent break iteration, each thread must use its own break iterator. |
| These can be obtained by creating separate break iterators of the desired type, |
| or by initially creating a master break iterator and then creating a clone for |
| each thread. |
| |
| ### Line Breaking Strictness, a CSS Property |
| |
| CSS has the concept of "[Line Breaking |
| Strictness](https://www.w3.org/TR/css-text-3/#line-break-property)". This |
| property specifies the strictness of line-breaking rules applied within an |
| element: especially how wrapping interacts with punctuation and symbols. ICU |
| line break iterators can choose a strictness using locale tags: |
| |
| | Locale | Behavior | |
| | ------------ | ----------- | |
| | `en@lb=strict` <br/> `ja@lb=strict` | Breaks text using the most stringent set of line-breaking rules | |
| | `en@lb=normal` <br/> `ja@lb=normal` | Breaks text using the most common set of line-breaking rules. | |
| | `en@lb=loose` <br/> `ja@lb=loose` | Breaks text using the least restrictive set of line-breaking rules. Typically used for short lines, such as in newspapers. | |
| |
| ### Sentence Break Filters |
| |
| Sentence breaking can return false positives - an indication that sentence ends |
| in an incorrect position - in the presence of abbreviations. For example, |
| consider the sentence |
| |
| > In the meantime Mr. Weston arrived with his small ship. |
| |
| Default sentence break shows a false boundary following the "Mr." |
| |
| ICU includes lists of common abbreviations that can be used to filter, to |
| ignore, these false sentence boundaries. Filtering is enabled by the presence of |
| the `ss` locale tag when creating the break iterator. |
| |
| | Locale | Behavior | |
| | ---------------- | ------------------------------------------------------- | |
| | `en` | no filtering | |
| | `en@ss=standard` | Filter based on common English language abbreviations. | |
| | `es@ss=standard` | Filter with common Spanish abbreviations. | |
| |
| Abbreviation lists are available (as of ICU 64) for English, German, Spanish, |
| French, Italian and Portuguese. |
| |
| ## Accuracy |
| |
| ICU's break iterators are based on the default boundary rules described in the |
| Unicode Standard Annexes [14](https://www.unicode.org/reports/tr14/) and |
| [29](https://www.unicode.org/unicode/reports/tr29/) . These are relatively |
| simple boundary rules that can be implemented efficiently, and are sufficient |
| for many purposes and languages. However, some languages and applications will |
| require a more sophisticated linguistic analysis of the text in order to find |
| boundaries with good accuracy. Such an analysis is not directly available from |
| ICU at this time. |
| |
| Break Iterators based on custom, user-supplied boundary rules can be created and |
| used by applications with requirements that are not met by the standard default |
| boundary rules. |
| |
| ## BreakIterator Boundary Analysis Examples |
| |
| ### Print out all the word-boundary positions in a UnicodeString |
| |
| **In C++:** |
| |
| ```c++ |
| void listWordBoundaries(const UnicodeString& s) { |
| UErrorCode status = U_ZERO_ERROR; |
| BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status); |
| bi->setText(s); |
| int32_t p = bi->first(); |
| while (p != BreakIterator::DONE) { |
| printf("Boundary at position %d\n", p); |
| p = bi->next(); |
| } |
| delete bi; |
| } |
| ``` |
| |
| **In C:** |
| |
| ```c |
| void listWordBoundaries(const UChar* s, int32_t len) { |
| UBreakIterator* bi; |
| int32_t p; |
| UErrorCode err = U_ZERO_ERROR; |
| bi = ubrk_open(UBRK_WORD, 0, s, len, &err); |
| if (U_FAILURE(err)) return; |
| p = ubrk_first(bi); |
| while (p != UBRK_DONE) { |
| printf("Boundary at position %d\n", p); |
| p = ubrk_next(bi); |
| } |
| ubrk_close(bi); |
| } |
| ``` |
| |
| ### Get the boundaries of the word that contains a double-click position |
| |
| **In C++:** |
| |
| ```c++ |
| void wordContaining(BreakIterator& wordBrk, |
| int32_t idx, |
| const UnicodeString& s, |
| int32_t& start, |
| int32_t& end) { |
| // this function is written to assume that we have an |
| // appropriate BreakIterator stored in an object or a |
| // global variable somewhere-- When possible, programmers |
| // should avoid having the create() and delete calls in |
| // a function of this nature. |
| if (s.isEmpty()) |
| return; |
| wordBrk.setText(s); |
| start = wordBrk.preceding(idx + 1); |
| end = wordBrk.next(); |
| // NOTE: for this and similar operations, use preceding() and next() |
| // as shown here, not following() and previous(). preceding() is |
| // faster than following() and next() is faster than previous() |
| // NOTE: By using preceding(idx + 1) above, we're adopting the convention |
| // that if the double-click comes right on top of a word boundary, it |
| // selects the word that _begins_ on that boundary (preceding(idx) would |
| // instead select the word that _ends_ on that boundary). |
| } |
| ``` |
| |
| **In C:** |
| |
| ```c |
| void wordContaining(UBreakIterator* wordBrk, |
| int32_t idx, |
| const UChar* s, |
| int32_t sLen, |
| int32_t* start, |
| int32_t* end, |
| UErrorCode* err) { |
| if (wordBrk == NULL || s == NULL || start == NULL || end == NULL) { |
| *err = U_ILLEGAL_ARGUMENT_ERROR; |
| return; |
| } |
| ubrk_setText(wordBrk, s, sLen, err); |
| if (U_SUCCESS(*err)) { |
| *start = ubrk_preceding(wordBrk, idx + 1); |
| *end = ubrk_next(wordBrk); |
| } |
| } |
| ``` |
| |
| ### Check for Whole Words |
| |
| Use the following to check if a range of text is a "whole word": |
| |
| **In C++:** |
| |
| ```c++ |
| UBool isWholeWord(BreakIterator& wordBrk, |
| const UnicodeString& s, |
| int32_t start, |
| int32_t end) { |
| if (s.isEmpty()) |
| return FALSE; |
| wordBrk.setText(s); |
| if (!wordBrk.isBoundary(start)) |
| return FALSE; |
| return wordBrk.isBoundary(end); |
| } |
| ``` |
| |
| **In C:** |
| |
| ```c |
| UBool isWholeWord(UBreakIterator* wordBrk, |
| const UChar* s, |
| int32_t sLen, |
| int32_t start, |
| int32_t end, |
| UErrorCode* err) { |
| UBool result = FALSE; |
| if (wordBrk == NULL || s == NULL) { |
| *err = U_ILLEGAL_ARGUMENT_ERROR; |
| return FALSE; |
| } |
| ubrk_setText(wordBrk, s, sLen, err); |
| if (U_SUCCESS(*err)) { |
| result = ubrk_isBoundary(wordBrk, start) && ubrk_isBoundary(wordBrk, end); |
| } |
| return result; |
| } |
| ``` |
| |
| Count the words in a document (C++ only): |
| |
| ```c++ |
| int32_t containsLetters(RuleBasedBreakIterator& bi, const UnicodeString& s, int32_t start) { |
| bi.setText(s); |
| int32_t count = 0; |
| while (start != BreakIterator::DONE) { |
| int breakType = bi.getRuleStatus(); |
| if (breakType != UBRK_WORD_NONE) { |
| // Exclude spaces, punctuation, and the like. |
| // A status value UBRK_WORD_NONE indicates that the boundary does |
| // not start a word or number. |
| // |
| ++count; |
| } |
| start = bi.next(); |
| } |
| return count; |
| } |
| ``` |
| |
| The function `getRuleStatus()` returns an enum giving additional information on |
| the text preceding the last break position found. Using this value, it is |
| possible to distinguish between numbers, words, words containing kana |
| characters, words containing ideographic characters, and non-word characters, |
| such as spaces or punctuation. The sample uses the break status value to filter |
| out, and not count, boundaries associated with non-word characters. |
| |
| ### Word-wrap a document (C++ only) |
| |
| The sample function below wraps a paragraph so that each line is less than or |
| equal to 72 characters. The function fills in an array passed in by the caller |
| with the starting offsets of |
| each line in the document. Also, it fills in a second array to track how many |
| trailing white space characters there are in the line. For simplicity, it is |
| assumed that an outside process has already broken the document into paragraphs. |
| For example, it is assumed that every string the function is passed has a single |
| newline at the end only. |
| |
| ```c++ |
| int32_t wrapParagraph(const UnicodeString& s, |
| const Locale& locale, |
| int32_t lineStarts[], |
| int32_t trailingwhitespace[], |
| int32_t maxLines, |
| UErrorCode &status) { |
| |
| int32_t numLines = 0; |
| int32_t p, q; |
| const int32_t MAX_CHARS_PER_LINE = 72; |
| UChar c; |
| |
| BreakIterator *bi = BreakIterator::createLineInstance(locale, status); |
| if (U_FAILURE(status)) { |
| delete bi; |
| return 0; |
| } |
| bi->setText(s); |
| |
| |
| p = 0; |
| while (p < s.length()) { |
| // jump ahead in the paragraph by the maximum number of |
| // characters that will fit |
| q = p + MAX_CHARS_PER_LINE; |
| |
| // if this puts us on a white space character, a control character |
| // (which includes newlines), or a non-spacing mark, seek forward |
| // and stop on the next character that is not any of these things |
| // since none of these characters will be visible at the end of a |
| // line, we can ignore them for the purposes of figuring out how |
| // many characters will fit on the line) |
| if (q < s.length()) { |
| c = s[q]; |
| while (q < s.length() |
| && (u_isspace(c) |
| || u_charType(c) == U_CONTROL_CHAR |
| || u_charType(c) == U_NON_SPACING_MARK |
| )) { |
| ++q; |
| c = s[q]; |
| } |
| } |
| |
| // then locate the last legal line-break decision at or before |
| // the current position ("at or before" is what causes the "+ 1") |
| q = bi->preceding(q + 1); |
| |
| // if this causes us to wind back to where we started, then the |
| // line has no legal line-break positions. Break the line at |
| // the maximum number of characters |
| if (q == p) { |
| p += MAX_CHARS_PER_LINE; |
| lineStarts[numLines] = p; |
| trailingwhitespace[numLines] = 0; |
| ++numLines; |
| } |
| // otherwise, we got a good line-break position. Record the start of this |
| // line (p) and then seek back from the end of this line (q) until you find |
| // a non-white space character (same criteria as above) and |
| // record the number of white space characters at the end of the |
| // line in the other results array |
| else { |
| lineStarts[numLines] = p; |
| int32_t nextLineStart = q; |
| |
| for (q--; q > p; q--) { |
| c = s[q]; |
| if (!(u_isspace(c) |
| || u_charType(c) == U_CONTROL_CHAR |
| || u_charType(c) == U_NON_SPACING_MARK)) { |
| break; |
| } |
| } |
| trailingwhitespace[numLines] = nextLineStart - q -1; |
| p = nextLineStart; |
| ++numLines; |
| } |
| if (numLines >= maxLines) { |
| break; |
| } |
| } |
| delete bi; |
| return numLines; |
| } |
| ``` |
| |
| Most text editors would not break lines based on the number of characters on a |
| line. Even with a monospaced font, there are still many Unicode characters that |
| are not displayed and therefore should be filtered out of the calculation. With |
| a proportional font, character widths are added up until a maximum line width is |
| exceeded or an end of the paragraph marker is reached. |
| |
| Trailing white space does not need to be counted in the line-width measurement |
| because it does not need to be displayed at the end of a line. The sample code |
| above returns an array of trailing white space values because an external |
| rendering process needs to be able to measure the length of the line (without |
| the trailing white space) to justify the lines. For example, if the text is |
| right-justified, the invisible white space would be drawn outside the margin. |
| The line would actually end with the last visible character. |
| |
| In either case, the basic principle is to jump ahead in the text to the location |
| where the line would break (without taking word breaks into account). Then, move |
| backwards using the preceding() method to find the last legal breaking position |
| before that location. Iterating straight through the text with next() method |
| will generally be slower. |
| |
| ## ICU BreakIterator Data Files |
| |
| The source code for the ICU break rules for the standard boundary types is |
| located in the directory |
| [icu4c/source/data/brkitr/rules](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/rules). |
| These files will be built, and the corresponding binary state tables |
| incorporated into ICU's data, by the standard ICU4C build process. |
| |
| The dictionary word lists used by word break, and for some languages, line break |
| are in |
| [icu4c/source/data/brkitr/dictionaries](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/dictionaries). |
| |
| The same data is used by both ICU4C and ICU4J. In the normal ICU build process, |
| the source data is processed into a binary form using ICU4C, and the resulting |
| binary tables are incorporated into ICU4J. |