-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re::engine::RE2 ignores the target string utf8 flag #14
Comments
likely related to #8 (comment) which has our incomplete fix for non-ascii matching. In general we always utf8::downgrade strings because native utf8 perl matching+capture can be horrendously slow - which is one reason why we switched to RE2 in a few performance critical areas |
The fact one person wants an upgrade and one a downgrade is why this is tricky to fix ;-) Ideally what needs to happen is re::e::RE2 needs to compile two There's also the issue it isn't as simple as you may think, because you can't even compile some regexps as Latin1 with re::engine::RE2, there's a 12 year old note in the todo that's still relevant: https://github.com/dgl/re-engine-RE2/blame/master/TODO $ perl -Mutf8 -Mblib -e 'use re::engine::RE2 -strict => 1; /😀\x{1F01}/'
$ perl -Mutf8 -Mblib -e 'use re::engine::RE2 -strict => 1; /\x{1F01}/'
invalid escape sequence: \x{1F0 at -e line 1. So maybe the overall approach is compile as UTF-8, then if needed lazily try Latin1. Maybe we could use the Perl regexp |
The workaround we actually use is to compile two patterns, one utf8 and one non-utf8, and then use the appropriate one for the target string. Successfully downgraded target strings get the non-utf8 pattern. That way we get the benefit of faster processing on target strings that don't need utf8 matching. It works for us, but that's because we don't need to use \x escapes above \xFF in our patterns. |
This is a known bug, but I wanted to document it here with a workaround.
re::engine::RE2, at least up through version 0.17, ignores the utf8 flag of the target string during regex matching, and instead operates on the target string's internal SV buffer contents as if it had the same utf8 flag value as the pattern.
This means, for example, that if the target string has the utf8 flag set, but the pattern does not have the utf8 flag, RE2 would treat the target as not having the utf8 flag set, which can prevent proper matching.
In this example, the "Circled Digit One" character,
\x{2460}
, should match the pattern looking for a character of category "Number, other",\p{No}
. It matches using Perl's built-in regex matching, but it does not match with RE2. RE2 is operating on the utf8-encoded bytes of the target string's internal buffer as if they were Latin-1 encoded bytes.Also, if the target string does not have the utf8 flag set, but the pattern does, RE2 would treat the target as having the utf8 flag set, which can prevent proper matching.
The "Division sign" character
\x{F7}
should match the pattern looking for a character of category "Symbol, math",\p{Sm}
. It matches using Perl's built-in regex matching, but it does not match with RE2. RE2 is operating on the Latin-1 encoded byte of the target string's internal buffer as if it were a utf8-encoded buffer.As a further consequence of this bug, the utf8 flag is set incorrectly on capture buffers, using the flag value of the pattern instead of the flag of the target, which can corrupt the captured text.
In this example the captured string should be equal to the target string, but using RE2, they are not equal because it failed to correctly set the utf8 flag on the capture buffer. The utf8-encoded bytes of the target string's internal buffer were copied to the capture buffer as if they were Latin-1 encoded, corrupting the text.
The Workaround
In short, RE2 only works correctly when the utf8 flag value of the target string matches the utf8 flag value of the pattern string. Perl programs are not supposed to need to know utf8 flag values or manipulate them. But in this case, doing so is necessary to work around the RE2 bug. The simplest way to accomplish this is to utf8::upgrade the target string just before matching and utf8::upgrade the pattern string just before matching, or just before compiling in the case where the pattern is compiled. For example:
Using utf8::upgrade on a string is generally a safe operation. It should have no consequences outside of this RE2 bug, and it will be compatible with any future RE2 version that fixes this bug.
The text was updated successfully, but these errors were encountered: