commit | d6b88d49e3be7096baf3828776c2b482a8ed1780 | [log] [tgz] |
---|---|---|
author | Andy Heninger <andy.heninger@gmail.com> | Sat Feb 01 20:20:37 2020 -0800 |
committer | Steven R. Loomis <srl295@gmail.com> | Mon Feb 03 16:51:17 2020 -0800 |
tree | 95495986e3726905d07dfe31823efccd08344311 | |
parent | b7d08bc04a4296982fcef8b6b8a354a9e4e7afca [diff] |
ICU-20939 Fix problem w regexp \b boundaries & UTF-8 text In regular expressions, when testing for word boundaries with \b, the boundaries were incorrect when in Unicode mode, meaning that an ICU word break iterator is being used to find the boundaries, and the text being matched is UTF-8 encoded. The bug stemmed from a misunderstanding of how string indexes work with UText and break iterators, leading to the inclusion of code to convert from UTF-8 to UTF-16 indexing, when what was wanted was the original UTF-8 index everywhere. Removing the indexing conversion fixes the problem.
This is the repository for the International Components for Unicode. The ICU project is under the stewardship of The Unicode Consortium.
Build | Status |
---|---|
TravisCI | |
Azure Pipelines | |
Azure Pipelines (Exhaustive Tests) | |
AppVeyor | |
Fuzzing |
icu4c/
ICU for C/C++icu4j/
ICU for Javatools/
Toolsvendor/
Vendor dependenciesPlease see ./icu4c/LICENSE (C and J are under an identical license file.)
Copyright © 2016 and later Unicode, Inc. and others. All Rights Reserved. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries. Terms of Use and License