Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re::engine::RE2 ignores the target string utf8 flag #14

Open
jbalazerpfpt opened this issue Jun 7, 2023 · 3 comments
Open

re::engine::RE2 ignores the target string utf8 flag #14

jbalazerpfpt opened this issue Jun 7, 2023 · 3 comments

Comments

@jbalazerpfpt
Copy link

jbalazerpfpt commented Jun 7, 2023

This is a known bug, but I wanted to document it here with a workaround.

re::engine::RE2, at least up through version 0.17, ignores the utf8 flag of the target string during regex matching, and instead operates on the target string's internal SV buffer contents as if it had the same utf8 flag value as the pattern.

This means, for example, that if the target string has the utf8 flag set, but the pattern does not have the utf8 flag, RE2 would treat the target as not having the utf8 flag set, which can prevent proper matching.

use re::engine::RE2;
print "match\n" if ("\x{2460}" =~ /\p{No}/);

In this example, the "Circled Digit One" character, \x{2460}, should match the pattern looking for a character of category "Number, other", \p{No}. It matches using Perl's built-in regex matching, but it does not match with RE2. RE2 is operating on the utf8-encoded bytes of the target string's internal buffer as if they were Latin-1 encoded bytes.

Also, if the target string does not have the utf8 flag set, but the pattern does, RE2 would treat the target as having the utf8 flag set, which can prevent proper matching.

use re::engine::RE2;
my $pattern = "\\p{Sm}\x{2461}?";
print "match\n" if ("\x{F7}" =~ /$pattern/);

The "Division sign" character \x{F7} should match the pattern looking for a character of category "Symbol, math", \p{Sm}. It matches using Perl's built-in regex matching, but it does not match with RE2. RE2 is operating on the Latin-1 encoded byte of the target string's internal buffer as if it were a utf8-encoded buffer.

As a further consequence of this bug, the utf8 flag is set incorrectly on capture buffers, using the flag value of the pattern instead of the flag of the target, which can corrupt the captured text.

use re::engine::RE2;
if ("\x{2460}" =~ /(.*)/) {
    print "equal\n" if ("\x{2460}" eq $1);
}

In this example the captured string should be equal to the target string, but using RE2, they are not equal because it failed to correctly set the utf8 flag on the capture buffer. The utf8-encoded bytes of the target string's internal buffer were copied to the capture buffer as if they were Latin-1 encoded, corrupting the text.

The Workaround

In short, RE2 only works correctly when the utf8 flag value of the target string matches the utf8 flag value of the pattern string. Perl programs are not supposed to need to know utf8 flag values or manipulate them. But in this case, doing so is necessary to work around the RE2 bug. The simplest way to accomplish this is to utf8::upgrade the target string just before matching and utf8::upgrade the pattern string just before matching, or just before compiling in the case where the pattern is compiled. For example:

use re::engine::RE2;
my $target = "\x{2460}";
my $pattern = "(\\p{No})";
utf8::upgrade($target);
utf8::upgrade($pattern);
if ($target =~ /$pattern/) {
    print "match\n";
    print "equal\n" if ($target eq $1);
}

Using utf8::upgrade on a string is generally a safe operation. It should have no consequences outside of this RE2 bug, and it will be compatible with any future RE2 version that fixes this bug.

@todd-richmond
Copy link

likely related to #8 (comment) which has our incomplete fix for non-ascii matching. In general we always utf8::downgrade strings because native utf8 perl matching+capture can be horrendously slow - which is one reason why we switched to RE2 in a few performance critical areas

@dgl
Copy link
Owner

dgl commented Jun 7, 2023

The fact one person wants an upgrade and one a downgrade is why this is tricky to fix ;-)

Ideally what needs to happen is re::e::RE2 needs to compile two RE2s, one as UTF-8 and one as Latin1, then pick (maybe one lazily, it needs to do at least one to handle errors). The first problem is however that's done has the potential to seriously affect performance for many cases.

There's also the issue it isn't as simple as you may think, because you can't even compile some regexps as Latin1 with re::engine::RE2, there's a 12 year old note in the todo that's still relevant: https://github.com/dgl/re-engine-RE2/blame/master/TODO

$ perl -Mutf8 -Mblib -e 'use re::engine::RE2 -strict => 1; /😀\x{1F01}/'
$ perl -Mutf8 -Mblib -e 'use re::engine::RE2 -strict => 1; /\x{1F01}/'
invalid escape sequence: \x{1F0 at -e line 1.

So maybe the overall approach is compile as UTF-8, then if needed lazily try Latin1. Maybe we could use the Perl regexp /a flag to actually mean Latin1, although that needs some more design thought; certainly how to do this while making it more perl compatible without regressing people's potentially performance critical code.

@jbalazerpfpt
Copy link
Author

jbalazerpfpt commented Jun 7, 2023

The workaround we actually use is to compile two patterns, one utf8 and one non-utf8, and then use the appropriate one for the target string. Successfully downgraded target strings get the non-utf8 pattern. That way we get the benefit of faster processing on target strings that don't need utf8 matching. It works for us, but that's because we don't need to use \x escapes above \xFF in our patterns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants