ICU-20876 Regex Grapheme Cluster matching with Break Iterators.

Change the implementation of grapheme cluster matching in regex to use an ICU
break iterator instead of a little one-off state machine.

The old implementation had fallen behind the Unicode UAX-29 specification for
graphem clusters, and could not be easily updated.

The implementation follows the same general pattern that is used for finding
word boundaries with an ICU break iterator. In reviewing that code, a few
improvements to the handling of ICU error codes were also made.

Also note that this change adds a new dependency on Break Iteration.  Regex
patterns that previously would work with ICU builds that were configured with
no break iteration will now fail. But only if they include \X for matching
grapheme cluster boundaries.
4 files changed
tree: 8d1f838d924139d1a853c2a70cba9606aca3bf56
  1. .ci-builds/
  2. .github/
  3. docs/
  4. icu4c/
  5. icu4j/
  6. tools/
  7. vendor/
  8. .appveyor.yml
  9. .cpyskip.txt
  10. .gitattributes
  11. .gitignore
  12. .travis.yml
  13. KEYS
  14. README.md
README.md

International Components for Unicode

This is the repository for the International Components for Unicode. The ICU project is under the stewardship of The Unicode Consortium.

ICU Logo

Build Status (master branch)

BuildStatus
TravisCIBuild Status
Azure PipelinesBuild Status
Azure Pipelines (Exhaustive Tests)Build Status
AppVeyorBuild status
FuzzingFuzzing Status

Subdirectories and Information

License

Please see ./icu4c/LICENSE (C and J are under an identical license file.)

Copyright © 2016 and later Unicode, Inc. and others. All Rights Reserved. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries. Terms of Use and License