ICU-20939 Fix problem w regexp \b boundaries & UTF-8 text

In regular expressions, when testing for word boundaries with \b, the
boundaries were incorrect when in Unicode mode, meaning that an ICU word break
iterator is being used to find the boundaries, and the text being matched is
UTF-8 encoded.

The bug stemmed from a misunderstanding of how string indexes work with UText
and break iterators, leading to the inclusion of code to convert from UTF-8 to
UTF-16 indexing, when what was wanted was the original UTF-8 index everywhere.
Removing the indexing conversion fixes the problem.
2 files changed
tree: 95495986e3726905d07dfe31823efccd08344311
  1. .ci-builds/
  2. .github/
  3. docs/
  4. icu4c/
  5. icu4j/
  6. tools/
  7. vendor/
  8. .appveyor.yml
  9. .cpyskip.txt
  10. .gitattributes
  11. .gitignore
  12. .travis.yml
  13. KEYS
  14. README.md
README.md

International Components for Unicode

This is the repository for the International Components for Unicode. The ICU project is under the stewardship of The Unicode Consortium.

ICU Logo

Build Status (master branch)

BuildStatus
TravisCIBuild Status
Azure PipelinesBuild Status
Azure Pipelines (Exhaustive Tests)Build Status
AppVeyorBuild status
FuzzingFuzzing Status

Subdirectories and Information

License

Please see ./icu4c/LICENSE (C and J are under an identical license file.)

Copyright © 2016 and later Unicode, Inc. and others. All Rights Reserved. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries. Terms of Use and License