| --- |
| layout: default |
| title: UText |
| nav_order: 4 |
| parent: Chars and Strings |
| --- |
| <!-- |
| © 2020 and later: Unicode, Inc. and others. |
| License & terms of use: http://www.unicode.org/copyright.html |
| --> |
| |
| # UText |
| |
| ## Overview |
| |
| UText is a text abstraction facility for ICU |
| |
| The intent is to make it possible to extend ICU to work with text data that is |
| in formats above and beyond those that are native to ICU. |
| |
| UText directly supports text in these formats: |
| |
| 1. UTF-8 (`char*`) strings |
| 2. UTF-16 (`UChar*` or `UnicodeString`) strings |
| 3. `Replaceable` |
| |
| The ICU services that can accept UText based input are: |
| |
| 1. Regular Expressions |
| 2. Break Iteration |
| |
| Examples of text formats that UText could be extended to support: |
| |
| 1. UTF-32 format. |
| 2. Text that is stored in discontiguous chunks in memory, or in application-specific representations. |
| 3. Text that is in a non-Unicode code page |
| |
| If ICU does not directly support a desired text format, it is possible for |
| application developers themselves to extend UText, and in that way gain the |
| ability to use their text with ICU. |
| |
| ## Using UText |
| |
| There are three fairly distinct classes of use of UText. These are: |
| |
| 1. **Simple wrapping of existing text.** Application text data exists in a |
| format that is already supported by UText (such as UTF-8). The application |
| opens a UText on the data, and then passes the UText to an ICU service for |
| analysis/processing. Most use of UText from applications will follow this |
| simple pattern. Only a very few UText APIs and only a few lines of code are |
| required. |
| |
| 2. **Accessing the underlying text.** UText provides APIs for iterating over |
| the text in various ways, and for fetching individual code points from the |
| text. These functions will probably be used primarily from within ICU, in |
| the implementation of services that can accept input in the form of a UText. |
| While applications are certainly free to use these text access functions if |
| necessary, there may often be no need. |
| |
| 3. **UText support for new text storage formats.** If an application has text |
| data stored in a format that is not directly supported by ICU, extending |
| UText to support that format will provide the ability to conveniently use |
| those ICU services that support UText. |
| |
| Extending UText to a new format is accomplished by implementing a well |
| defined set of *Text Provider Functions* for that format. |
| |
| ## UText compared with CharacterIterator |
| |
| CharacterIterator is an abstract base class that defines a protocol for |
| accessing characters in a text-storage object. This class has methods for |
| iterating forward and backward over Unicode characters to return either the |
| individual Unicode characters or their corresponding index values. |
| |
| UText and CharacterIterator both provide an abstraction for accessing text while |
| hiding details of the actual storage format. UText is the more flexible of the |
| two, however, with these advantages: |
| |
| 1. UText can conveniently operate on text stored in formats other than UTF-16. |
| 2. UText includes functions for modifying or editing the text. |
| 3. UText is more efficient. When iterating over a range of text using the |
| CharacterIterator API, a function call is required for every character. With |
| UText, iterating to the next character is usually done with small amount of |
| inline code. |
| |
| At this time, more ICU services support CharacterIterator than UText. ICU |
| services that can operate on text represented by a CharacterIterator are |
| |
| 1. Normalizer |
| 2. Break Iteration |
| 3. String Search |
| 4. Collation Element Iteration |
| |
| ## Example: Counting the Words in a UTF-8 String |
| |
| Here is a function that uses UText and an ICU break iterator to count the number |
| of words in a nul-terminated UTF-8 string. The use of UText only adds two lines |
| of code over what a similar function operating on normal UTF-16 strings would |
| require. |
| |
| ```c |
| #include "unicode/utypes.h" |
| #include "unicode/ubrk.h" |
| #include "unicode/utext.h" |
| |
| int countWords(const char *utf8String) { |
| UText *ut = NULL; |
| UBreakIterator *bi = NULL; |
| int wordCount = 0; |
| UErrorCode status = U_ZERO_ERROR; |
| |
| ut = utext_openUTF8(ut, utf8String, -1, &status); |
| bi = ubrk_open(UBRK_WORD, "en_us", NULL, 0, &status); |
| |
| ubrk_setUText(bi, ut, &status); |
| while (ubrk_next(bi) != UBRK_DONE) { |
| if (ubrk_getRuleStatus(bi) != UBRK_WORD_NONE) { |
| /* Count only words and numbers, not spaces or punctuation */ |
| wordCount++; |
| } |
| } |
| utext_close(ut); |
| ubrk_close(bi); |
| assert(U_SUCCESS(status)); |
| return wordCount; |
| } |
| ``` |
| |
| ## UText API Functions |
| |
| The UText API is declared in the ICU header file |
| [utext.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utext.h) |
| |
| ### Opening and Closing. |
| |
| Normal usage of UText by an application consists of opening a UText to wrap some |
| existing text, then passing the UText to ICU functions for processing. For this |
| kind of usage, all that is needed is the appropriate UText open and close |
| functions. |
| |
| | Function | Description | |
| |--------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------| |
| | `uext_openUChars` | Open a UText over a standard ICU (`UChar *`) string. The string consists of a UTF-16 array in memory, either nul terminated or with an explicit length. | |
| | `utext_openUnicodeString` | Open a UText over an instance of an ICU C++ `UnicodeString`. | |
| | `Utext_openConstUnicodeString` | Open a UText over a read-only `UnicodeString`. Disallows UText APIs that modify the text. | |
| | `utext_openReplaceable` | Open a UText over an instance of an ICU C++ `Replaceable`. | |
| | `utext_openUTF8` | Open a UText over a UTF-8 encoded C string. May be either Nul terminated or have an explicit length. | |
| | `utext_close` | Close an open UText. Frees any allocated memory; required to prevent memory leaks. | |
| |
| Here are some suggestions and techniques for efficient use of UText. |
| |
| #### Minimizing Heap Usage |
| |
| Utext's open functions include features to allow applications to minimize the |
| number of heap memory allocations that will be needed. Specifically, |
| |
| 1. UText structs may declared as local variables, that is, they may be stack |
| allocated rather than heap allocated. |
| 2. Existing UText structs may be reused to refer to new text, avoiding the need |
| to allocate and initialize a new UText instance. |
| |
| Minimizing heap allocations is important in code that has critical performance |
| requirements, and is doubly important for code that must scale well in |
| multithreaded, multiprocessor environments. |
| |
| #### Stack Allocation |
| |
| Here is code for stack-allocating a UText: |
| |
| ```c |
| UText mytext = UTEXT_INITIALIZER; |
| utext_openUChars(&myText, ... |
| ``` |
| |
| The first parameter to all `utext_open` functions is a pointer to a UText. If it |
| is non-null, the supplied UText will be used; if it is null, a new UText will be |
| heap allocated. |
| |
| Stack allocated UText objects *must *be initialized with `UTEXT_INITIALIZER`. An |
| uninitialized instance will fail to open. |
| |
| #### Heap Allocation |
| |
| Here is code for creating a heap allocated UText: |
| |
| ```c |
| UText *mytext = utext_openUChars(NULL, ... |
| ``` |
| |
| This is slightly smaller and more convenient to write than the stack allocated |
| code, and there is no reason not to use heap allocated UText objects in the vast |
| majority of code that does not have extreme performance constraints. |
| |
| #### Reuse |
| |
| To reuse an existing UText, simply pass it as the first parameter to any of the |
| UText open functions. There is no need to close the UText first, and it may |
| actually be more efficient not to close it first. |
| |
| Here is an example of a function that iterates over an array of UTF-8 strings, |
| wrapping each in a UText and passing it off to another function. On the first |
| time through the loop the utext open function will heap allocate a UText. On |
| each subsequent iterations the existing UText will be reused. |
| |
| ```c |
| #include "unicode/utypes.h" |
| #include "unicode/utext.h" |
| |
| void f(char **strings, int numStrings) { |
| UText *ut = NULL; |
| UErrorCode status; |
| |
| int i; |
| for (i=0; i<numStrings; i++) { |
| status = U_ZERO_ERROR; |
| ut = utext_openUTF8(ut, strings[i], -1, &status); |
| assert(U_SUCCESS(status)); |
| do_something(ut); |
| } |
| utext_close(ut); |
| } |
| ``` |
| |
| #### close |
| |
| Closing a UText with `utext_close()` frees any storage associated with it, including the UText itself |
| for those that are heap allocated. Stack allocated UTexts should also be closed |
| because in some cases there may be additional heap allocated storage associated |
| with them, depending on the type of the underlying text storage. |
| |
| ## Accessing the Text |
| |
| For accessing the underlying text, UText provides functions both for iterating |
| over the characters, and for direct random access by index. Here are the |
| conventions that apply for all of the access functions: |
| |
| 1. access to individual characters is always by code points, that is, 32 bit |
| Unicode values are always returned. UTF-16 surrogate values from a surrogate |
| pair, like bytes from a UTF-8 sequence, are not separately visible. |
| 2. Indexing always uses the index values from the original underlying text |
| storage, in whatever form it has. If the underlying storage is UTF-8, the |
| indexes will be UTF-8 byte indexes, not UTF-16 offsets. |
| 3. Indexes always refer to the first position of a character. This is |
| equivalent to saying that indexes always lie at the boundary between |
| characters. If an index supplied to a UText function refers to the 2<sup>nd</sup> |
| through the N<sup>th</sup> positions of a multi byte or multi-code-unit character, the |
| index will be normalized back to the first or lowest index. |
| 4. An input index that is greater than the length of the text will be set to |
| refer to the end of the string, and will not generate out of bounds error. |
| This is similar to the indexing behavior in the UnicodeString class. |
| 5. Iteration uses post-increment and pre-decrement conventions. That is, |
| `utext_next32()` fetches the code point at the current index, then leaves the |
| index pointing at the next character. |
| |
| Here are the functions for accessing the actual text data represented by a |
| UText. The primary use of these functions will be in the implementation of ICU |
| services that accept input in the form of a UText, although application code may |
| also use them if the need arises. |
| |
| For more detailed descriptions of each, see the API reference. |
| |
| | Function | Description | |
| |-------------------------|------------------------------------------------------------------------------------------------------------| |
| | `utext_nativeLength` | Get the length of the text string in terms of the underlying native storage – bytes for UTF-8, for example | |
| | `utext_isLengthExpensive` | Indicate whether determining the length of the string would require scanning the string. | |
| | `utext_char32At` | Get the code point at the specified index. | |
| | `utext_current32` | Get the code point at the current iteration position. Does not advance the position. | |
| | `utext_next32` | Get the next code point, iterating forwards. | |
| | `utext_previous32` | Get the previous code point, iterating backwards. | |
| | `utext_next32From` | Begin a forwards iteration at a specified index. | |
| | `utext_previous32From` | Begin a reverse iteration at a specified index. | |
| | `utext_getNativeIndex` | Get the current iteration index. | |
| | `utext_setNativeIndex` | Set the iteration index. | |
| | `utext_moveIndex32` | Move the current index forwards or backwards by the specified number of code points. | |
| | `utext_extract` | Retrieve a range of text, placing it into a UTF-16 buffer. | |
| | `UTEXT_NEXT32` | inline (high performance) version of `utext_next32` | |
| | `UTEXT_PREVIOUS32` | inline (high performance) version of `utext_previous32` | |
| |
| ## Modifying the Text |
| |
| UText provides API for modifying or editing the text. |
| |
| | Function | Description | |
| |---------------------|----------------------------------------------------------------------------------------------------| |
| | `utext_replace` | Replace a range of the original text with a replacement string. | |
| | `utext_copy` | Copy or Move a range of the text to a new position. | |
| | `utext_isWritable` | Test whether a UText supports writing operations. | |
| | `utext_hasMetaData` | Test whether the text includes metadata. See the class `Replaceable` for more information on meta data.. | |
| |
| Certain conventions must be followed when modifying text using these functions: |
| |
| 1. Not all types of UText can support modifying the data. Code working with |
| UText instances of unknown origin should check `utext_isWritable()` first, and |
| be prepared to deal with failures. |
| 2. There must be only one UText open onto the underlying string that is being |
| modified. (Strings that are not being modified can be the target of any |
| number of UTexts at the same time) The existence of a second UText that |
| refers to a string that is being modified is not a situation that is |
| detected by the implementation. The application code must be structured to |
| avoid the situation. |
| |
| #### Cloning |
| |
| UText instances may be cloned. The clone function, |
| |
| ```c |
| UText * utext_clone(UText *dest, |
| const UText *src, |
| UBool deep, |
| UBool readOnly, |
| UErrorCode *status) |
| ``` |
| |
| behaves very much like a UText open functions, with the source of the text being |
| another UText rather than some other form of a string. |
| |
| A *shallow* clone creates a new UText that maintains its own iteration state, |
| but does not clone the underlying text itself. |
| |
| A *deep* clone copies the underlying text in addition to the UText state. This |
| would be appropriate if you wished to modify the text without the changes being |
| reflected back to the original source string. Not all text providers support |
| deep clone, so checking for error status returns from `utext_clone()` is |
| importatnt. |
| |
| #### Thread Safety |
| |
| UText follows the usual ICU conventions for thread safety: concurrent calls to |
| functions accessing the same non-const UText is not supported. If concurrent |
| access to the text is required, the UText can be cloned, allowing each thread |
| access via a separate UText. So long as the underlying text is not being |
| modified, a shallow clone is sufficient. |
| |
| ## Text Providers |
| |
| A *text provider* is a set of functions that let UText support a specific text |
| storage format. |
| |
| ICU includes several UText text provider implementations, and applications can |
| provide additional ones if needed. |
| |
| To implement a new UText text provider, it is necessary to have an understanding |
| of how UText is designed. |
| |
| Underneath the covers, UText is a struct that includes: |
| |
| 1. A pointer to a *Text Chunk*, which is a UTF-16 buffer containing a section |
| (or all) of the text being referenced. |
| |
| For text sources whose native format |
| is UTF-16, the chunk description can refer directly to the original text |
| data. For non-UTF-16 sources, the chunk will refer to a side buffer |
| containing some range of the text that has been converted to UTF-16 format. |
| 2. The iteration position, as a UTF-16 offset within the chunk. |
| |
| If a text access function (one of those described above, in the previous |
| section) can do its thing based on the information maintained in the UText |
| struct, it will. If not, it will call out to one of the provider functions |
| (below) to do the work, or to update the UText. |
| |
| The best way to really understand what is required of a UText provider is to |
| study the implementations that are included with ICU, and to borrow as much as |
| possible. |
| |
| Here is the list of text provider functions. |
| |
| | Function | Description | |
| |----------------------------|----------------------------------------------------------------------------------------------------| |
| | `UTextAccess` | Set up the Text Chunk associated with this UText so that it includes a requested index position. | |
| | `UTextNativeLength` | Return the full length of the text. | |
| | `UTextClone` | Clone the UText. | |
| | `UTextExtract` | Extract a range of text into a caller-supplied buffer | |
| | `UTextReplace` | Replace a range of text with a caller-supplied replacement. May expand or shrink the overall text. | |
| | `UTextCopy` | Move or copy a range of text to a new position. | |
| | `UTextMapOffsetToNative` | Within the current text chunk, translate a UTF-16 buffer offset to an absolute native index. | |
| | `UTextMapNativeIndexToUTF16` | Translate an absolute native index to a UTF-16 buffer offset within the current text. | |
| | `UTextClose` | Provider specific close. Free storage as required. | |
| |
| Not every provider type requires all of the functions. If the text type is |
| read-only, no implementation for Replace or Copy is required. If the text is in |
| UTF-16 format, no implementation of the native to UTF-16 index conversions is |
| required. |
| |
| To fully understand what is required to support a new string type with UText, it |
| will be necessary to study both the provider function declarations from |
| [utext.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utext.h) |
| and the existing text provider implementations in |
| [utext.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/utext.cpp). |