blob: 3d9d1a7bd4ef96921b3270e445d9e7c7500a84c5 [file] [log] [blame] [view]
---
layout: default
title: CharacterIterator
nav_order: 3
parent: Chars and Strings
---
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# CharacterIterator Class
## Overview
CharacterIterator is the abstract base class that defines a protocol for
accessing characters in a text-storage object. This class has methods for
iterating forward and backward over Unicode characters to return either the
individual Unicode characters or their corresponding index values.
Using CharacterIterator ICU iterates over text that is independent of its
storage method. The text can be stored locally or remotely in a string, file,
database, or other method. The CharacterIterator methods make the text appear as
if it is local.
The CharacterIterator keeps track of its current position and index in the text
and can do the following
1. Move forward or backward one Unicode character at a time
2. Jump to a new location using absolute or relative positioning
3. Move to the beginning or end of its range
4. Return a character or the index to a character
The information can be restricted to a sub-range of characters, can contain a
large block of text that can be iterated as a whole, or can be broken into
smaller blocks for the purpose of iteration.
> :point_right: **Note**: *CharacterIterator is different from
[Normalizer](../transforms/normalization/index) in that CharacterIterator
walks through the Unicode characters without interpretation.*
Prior to ICU release 1.6, the CharacterIterator class allowed access to a single
UChar at a time and did not support variable-width encoding. Single UChar
support makes it difficult when supplementary support is expected in UTF16
encodings. Beginning with ICU release 1.6, the CharacterIterator class now
efficiently supports UTF-16 encodings and provides new APIs for UTF32 return
values. The API names for the UTF16 and UTF32 encodings differ because the UTF32
APIs include "32" within their naming structure. For example,
CharacterIterator::current() returns the code unit and Character::current32()
returns a code point.
## Base class inherited by CharacterIterator
The class,
[ForwardCharacterIterator,](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classForwardCharacterIterator.html)
is a superclass of the CharacterIterator class. This superclass provides methods
for forward iteration only for both UTF16 and UTF32 access, and is and based on
a efficient forward iteration mechanism. In some situations, where you need to
iterate over text that does not allow random-access, the
ForwardCharacterIterator superclass is the most efficient method. For example,
iterate a UChar string using a character converter with the [ucnv_getNextUChar()
function.](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html)
## Subclasses of CharacterIterator provided by ICU
ICU provides the following concrete subclasses of the CharacterIteratorclass:
1. [UCharCharacterIterator](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classUCharCharacterIterator.html)
subclass iterates over a `UChar[]` array.
2. [StringCharacterIterator](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classStringCharacterIterator.html)
subclass extends from `UCharCharacterIterator` and iterates over the contents
of a `UnicodeString`.
## Usage
To use the methods specified in CharacterIterator class, do one of the
following:
1. Make a subclass that inherits from the CharacterIterator class
2. Use the StringCharacterIterator subclass
3. Use the UCharCharacterIterator subclass
CharacterIterator objects keep track of its current position within the text
that is iterated over. The CharacterIterator class uses an object similar to a
cursor that gets initialized to the beginning of the text and advances according
to the operations that are used on the object. The current index can move
between two positions (a start and a limit) that are set with the text. The
limit position is one character greater than the position of the last UChar
character that is used.
### Forward iteration
For efficiency, ICU can iterate over text using post-increment semantics or
Forward Iteration. Forward Iteration is an access method that reads a character
from the current index position and moves the index forward. It leaves the index
behind the character it read and returns the character read. ICU can use
nextPostInc() or next32PostInc() calls with hasNext() to perform Forward
Iteration. These calls are the only character access methods provided by the
ForwardCharacterIterator. An iteration loop can be started with the
setToStart(), firstPostInc() or first32PostInc()calls . (The setToStart() call
is implied after instantiating the iterator or setting the text.)
The less efficient forward iteration mechanism that is available for
compatibility with Java provides pre-increment semantics. With these methods,
the current character is skipped, and then the following character is read and
returned. This is a less efficient method for a variable-width encoding because
the width of each character is determined twice; once to read it and once to
skip it the next time ICU calls the method. The methods used for Forward
Iteration are the next() or next32() calls. An iteration loop must start with
first() or first32() calls to get the first character.
### Backward iteration
Backward Iteration has pre-decrement semantics, which are the exact opposite of
the post-increment Forward Iteration. The current index reads the character that
precedes the index, the character is returned, and the index is left at the
beginning of this character. The methods used for Backward Iteration are the
previous() or previous32() calls with the hasPrevious() call . An iteration loop
can be started with setToEnd(), last(), or last32() calls.
### Direct index manipulation
The index can be set and moved directly without iteration to start iterating at
an arbitrary position, skip some characters, or reset the index to an earlier
position. It is possible to set the index to one after the last text code unit
for backward iteration.
The setIndex() and setIndex32() calls set the index to a new position and return
the character at that new position. The setIndex32() call ensures that the new
position is at the beginning of the character (on its first code unit). Since
the character at the new position is returned, these functions can be used for
both pre-increment and post-increment iteration semantics.
Similarly, the current() and current32() calls return the character at the
current index without modifying the index. The current32() call retrieves the
complete character whether the index is on the first code unit or not.
The index and the iteration boundaries can be retrieved using separate
functions. The following syntax is used by ICU: startIndex() <= getIndex() <=
endIndex().
Without accessing the text, the setToStart() and setToEnd() calls set the index
to the start or to the end of the text. Therefore, these calls are efficient in
starting a forward (post-increment) or backward iteration.
The most general functions for manipulating the index position are the move()
and move32() calls. These calls allow you to move the index forward or backward
relative to its current position, start the index, or move to the end of the
index. The move() and move32() calls do not access the text and are best used
for skipping part of it. The move32() call skips complete code points like
next32PostInc() call and other UChar32-access methods.
### Access to the iteration text
The CharacterIterator class provides the following access methods for the entire
text under iteration:
1. getText() sets a UnicodeString with the text
2. getLength() returns just the length of the text.
This text (and the length) may include more than the actual iteration area
because the start and end indexes may not be the start and end of the entire
text. The text and the iteration range are set in the implementing subclasses.
## Additional Sample Code
C/C++: See
[icu4c/source/samples/citer/](https://github.com/unicode-org/icu/blob/main/icu4c/source/samples/citer/)
in the ICU source distribution for code samples.