RE2 compatibility for 920120 #1663

allanrbo · 2020-01-10T23:25:14Z

Problem

Negative lookbehinds are not supported by many regex engines (RE2, Go's regexp, Hyperscan).

My motivation for avoiding this PCRE-specific construct is that I am working on an experimental WAF that uses CRS, but uses Hyperscan and Go's regexp package rather than PCRE. I'm hoping to be able to keep using vanilla CRS, so this is why I'm hoping to get this accepted in upstream. Others who are also doing similar experiments with non-PCRE regex engines will benefit too.

Suggested rewrite

The goal of the original regex was to disallow semicolon, except if it was preceeded by some strings that make it a well known HTML entity, such as ä, â, etc.

The approach of my solution is to write alternatives for each possible prefix that is not &[aAoOuUyY]uml;, &[aAeEiIoOuU]circ;, etc.

I have described the approach in detail here, and posted a Python script I used to generate the final regex: http://allanrbo.blogspot.com/2020/01/alternative-to-negative-lookbehinds-in.html

This is a tough tradeoff in human readability of the rule. If anyone has any ideas for a more readable approach, I'd be very eager to hear.

One alternative approach I've explored but failed at, is to split the rule into multiple chained rules. The first rule in the chain could catch something like (&[^;]+)?;. The second rule in the chain could then be a negative check, something like SecRule TX:1 "!@rx (&[aAoOuUyY]uml;|&[aAeEiIoOuU]circ|...)" .... Unfortunately this fails when the input has multiple semicolons. The first semicolon may be preceeded by &auml, thereby only evaluating the second chained rule for this TX:1, ignoring subsequent semicolons which may contain the attack.

Testing

Ran all the FTW regression tests.

Manually tested via regex101.com:

file=.txt               shouldMatchTrue
fi=le                   shouldMatchTrue
fi'le                   shouldMatchTrue
file                    shouldMatchFalse
fi;le                   shouldMatchTrue
;zzzz                   shouldMatchTrue
zzz&amp;zzz             shouldMatchFalse
zzz&Auml;zzz            shouldMatchFalse
zzz&auml;zzz            shouldMatchFalse
&amp;zzz                shouldMatchFalse
amp;zzz                 shouldMatchTrue
mp;zzz                  shouldMatchTrue
p;zzz                   shouldMatchTrue
Zamp;zzz                shouldMatchTrue
Zmp;zzz                 shouldMatchTrue
Zp;zzz                  shouldMatchTrue
Z;zzz                   shouldMatchTrue
ZZZamp;zzz              shouldMatchTrue
ZZZmp;zzz               shouldMatchTrue
ZZZp;zzz                shouldMatchTrue
ZZZ;zzz                 shouldMatchTrue
mZ;zzz                  shouldMatchTrue

Additionally wrote a little Go program to test against a few different regex engines: https://gist.github.com/allanrbo/93fb64549151169ff3a8a107cce8ce1b

fgsch · 2020-01-11T09:47:01Z

I haven't looked at this in detail yet but it occurs to me we could move it to util/regexp-assemble/regexp-920120.data to help readability, at least for the non-assembled version (assuming the generated version works the same way that the one in this PR).

fgsch · 2020-01-12T22:20:07Z

Having had another look, I realise the pattern presented is generated by a script. My bad, sorry.

Perhaps we should include your script in a similar way we include regexp-assemble.pl and generate the final pattern so the original pattern is readable.

As for the pattern, I believe you could extract the common suffix as well (^|).
I do wonder how this compares, performance wise, with the original pattern when using pcre.
Have you done any testing?

Finally, as with previous cases replacing lookarounds, the resulting match will be different.
I don't think this is an issue but it's worth having in mind.

dune73 · 2020-01-13T07:24:03Z

Awesome PR and very interesting blogpost. Thank you very much @allanrbo.

I second @fgsch's remark about inclusion of the generator into our repository.

I have also looked up 920120 in my logs to see if we can come up with additional FTW tests - the 3 we have are a bit thin - but no luck. However, I suggest to add tests to the PR based on Allan's work and documentation. This will be useful as we move forward.

allanrbo · 2020-01-13T21:46:21Z

Thanks for reading. Good idea with including the generator script in this repo here, so it doesn't get lost. Will update the PR. Will also see if I can make some more FTW tests from those manual tests I mentioned above actually.

fgsch · 2020-01-15T10:54:15Z

Thanks for updating the PR.

Have you seen this part of my comment?

As for the pattern, I believe you could extract the common suffix as well (^|).
I do wonder how this compares, performance wise, with the original pattern when using pcre.
Have you done any testing?

allanrbo · 2020-01-15T20:04:04Z

Sorry, overlooked those questions. Good questions.

Regarding extracting ^

I don't think it's possible to extract out the ^. Consider this example:

(
    (^|[^c])def
    |
    (^|[^b])cdef
    |
    (^|[^a])bcdef
)

And further consider just this part:

(^|[^c])def

The ^ meaning "beginning of string" is only in one of the alternatives. The ^ in [^c] instead means "a character that is not c".

Regarding perf

With regards to performance, yes, it's worse. So it's a tradeoff. Especially for WAFs running ModSecurity, which is currently probably close to 100% of users. The reason I still propose it is that it paves the way to make CRS useful on non-ModSecurity WAFs too.

Here are some numbers I've generated from this test program, which evaluates the old and new regexes 1 million times in the input helloworld;helloworld and then divides the time by 1 million to get the time per evaluation:

PCRE with old regex: 1.632744578 microseconds
PCRE with new regex: 14.663659765 microseconds
Hyperscan with new regex: 0.414589807 microseconds

Regarding whether match is different

I've taken great care to ensure it's not different. That's what I've written about here: http://allanrbo.blogspot.com/2020/01/alternative-to-negative-lookbehinds-in.html . But sure, there could be a flaw in my reasoning that I haven't realized yet :-)

fgsch · 2020-01-16T13:23:31Z

Regarding extracting ^

Right. I understand the nuances of ^. I was a bit hasty here and didn't analyse the pattern in detail before I suggested the move though. You are entirely correct.

Regarding perf

Based on your tests we are talking about a ~9x performance penalty for 99.99% of users.
I understand where you are coming from but this makes me a bit hesitant. The number of branches with backtracking makes me a bit uneasy as well. Perhaps both of these concerns are unwarranted.

FWIW, for Hyperscan, this is somewhat less of a problem since you can use Chimera. For RE2 unfortunately there is no alternative.

Regarding whether match is different

What I meant here is that while for the same input both patterns will match, what the patterns will match on will differ. Taking the input above and comparing the results before and after this change shows, for example:

 amp;zzz
- 0: ;
+ 0: amp;
 mp;zzz
- 0: ;
+ 0: mp;

fgsch · 2020-01-16T13:28:51Z

For a simpler test limited to pcre, you can run pcretest -TM <file> using the different patterns and inputs.

allanrbo · 2020-01-16T20:37:31Z

Hi Federico, thanks for the input. Yea I totally get if CRS doesn't want to merge this PR. It's fine if we just close it. I know it doesn't have any short term benefit for anyone besides me. The reason I thought I'd share it anyway, was because it may have a long term benefit, as it paves the way for CRS on future WAF engines that strictly avoid PCRE (like the experiment I'm currently prototyping).

I wish there was a way to achieve this without such a long, complex, and slow regex. I feel like there may be a way with chained rules, but just haven't been able to find one...

Perf discussion continued

Thinking of it as a 900% perf penalty I feel is kind of blunt. It's probably closer to 0.13%, and only on very specific requests. Here's why I believe so.

No requests will be 9x slower, but on those specific form-data/multipart POST requests that also contain a file name field, that one operation to scan the name and file name field that used to take ModSecurity around 0.001633 ms now is now 9x slower at around 0.01467 ms. On my computer with Nginx+ModSecurity I ran a couple of thousand runs of a simple 520 byte form-data/multipart POST non-keepalive request with a file field. It averaged at around 5.189 ms. So the theoretical overhead of this should be (0.01467-0.001633)/5.189=0.01303=0.13% on these requests. I reran with the new regex, and it avereged 5.171 ms - suprisingly lower but close enough to be a measurement error.

Regarding backtracking

RE2 will never backtrack. They achieve |-alternatives without backtracking, purely via state machines I think: https://github.com/google/re2/wiki/WhyRE2 .

pcretest

I couldn't find pcretest, but found pcre2test in Ubuntu. Assuming it's equivalent. Ran pcre2test -TM with the old and new regexes, with the above tests I wrote, and got the old to be Total match time 0.0028 milliseconds and the new to be Total match time 0.0924 milliseconds. So yea, the new regex is definitely slower on PCRE. No question about that.

Whether match is different

I understand what you mean now. Yes, TX:0 will be slightly different with the new regex, but since TX:0 is not used I don't think this is a problem?

fgsch · 2020-01-19T21:47:07Z

It's too early to tell either way (whether this will be merged) but the more information we have, the more informed the decision will be.

Perf discussion continued

This will be the worse case scenario based on your tests, but still something we should consider.
My concerns are not limited to this, however. As I mentioned, the number of branches and backtracking makes me uneasy as well.

Regarding backtracking

I was referring to PCRE here, the only supported engine in ModSecurity at this time.

Whether match is different

TX.0 and MATCHED_VAR are basically the same, so it will affect logdata but as I mentioned previously I don't think it's a problem, just something to have in mind.

allanrbo · 2020-01-21T20:49:41Z

Whether match is different

I believe MATCHED_VAR will be the same before and after, because MATCHED_VAR contains the entire field that was matched, and not just the substring that the regex matched.

For example I just tested this rule

SecRule ARGS helloworld123 "id:100,deny,capture,logdata:'MATCHED_VAR: %{MATCHED_VAR}, TX.0: %{TX.0}'"

when given the URL

/?a=abchelloworld123abc

produces this logdata

[data "MATCHED_VAR: abchelloworld123abc, TX.0: helloworld123"]

dune73 · 2020-01-22T03:30:09Z

That's the correct and expected behavior for MATCHED_VAR, I think.

fgsch · 2020-01-22T11:24:11Z

I stand corrected. As you said, there will be no difference in logging with this change.

dune73 · 2020-02-11T13:28:42Z

Meeting decision: @dune73 will look into this, namely the viability of the performance impact and we can then see if we accept the PR (or drop it).

#1671 (comment)

dune73 · 2020-02-11T13:29:48Z

Adding the needs-action label to express the fact that we need hard performance data from production.

Same as SpiderLabs/owasp-modsecurity-crs#1663 . I doubt they will accept the PR in their repo as the downsides seem to outweigh the benefits for them. PR URL: https://msazure.visualstudio.com/DefaultCollection/One/_git/Networking-Azwaf/pullrequest/2495892 Related work items: #5880651

The regex backreference support works by parsing regexes at initialization to detect the presence of backreferences such as `\1`, `\2`, etc. These are then replaced with the content of the group that they were referencing. So for example `(abc+) \1` becomes `(abc+) (abc+)`. The `tryMatch` function then does the actual comparison to ensure that the two groups actually had the same content. Only groups that are surrounded by `\b` word boundaries are supported, such as `\b(abc+)\b`. This is because otherwise it becomes really complicated with situations like `11=111111`. The CRS also contains this construct that makes use of a negative lookahead to assert that a backreference is _not equal_ to a group (simplified): `\b(\w+)\b != (?!\b(\1)\b)(\w+)`. Support for these special construct is therefore also implemented. The changes to CRS rule 942130 is currently pending checkin: RE2 compatibility for 920120 SpiderLabs/owasp-modsecurity-crs#1663 Because this gets us to 100% of the CRS tests passing, I have removed code that was preventing running the CRS tests by default in the CI pipeline. So the CRS tests now always run by default. PR URL: https://msazure.visualstudio.com/DefaultCollection/One/_git/Networking-Azwaf/pullrequest/2668744 Related work items: #5697858

RE2 compatibility for 920120

337f740

allanrbo added 3 commits January 14, 2020 00:20

Script to generate alternative to negative lookbehinds

1e84125

quotes issue in test

2b7ea5b

Removed test that doesnt work on Apache

7e9bc8b

dune73 mentioned this pull request Feb 9, 2020

Monthly Chat Agenda February (extraordinary changed to: 2020-02-10) #1671

Closed

dune73 self-assigned this Feb 11, 2020

dune73 added the Needs action label Feb 11, 2020

dune73 mentioned this pull request Mar 2, 2020

Monthly Chat Agenda March (2020-03-02) #1683

Closed

dune73 mentioned this pull request Apr 6, 2020

Monthly Chat Agenda April (2020-04-06) #1733

Closed

franbuehler mentioned this pull request Apr 29, 2020

Monthly Chat Agenda May (2020-05-04) #1749

Closed

CRS-migration-bot mentioned this pull request May 13, 2020

RE2 compatibility for 920120 coreruleset/coreruleset#1663

Closed

franbuehler mentioned this pull request Aug 30, 2020

Migrate PR: RE2 compatibility for 920120 coreruleset/coreruleset#1868

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RE2 compatibility for 920120 #1663

RE2 compatibility for 920120 #1663

allanrbo commented Jan 10, 2020 •

edited

Loading

fgsch commented Jan 11, 2020 •

edited

Loading

fgsch commented Jan 12, 2020

dune73 commented Jan 13, 2020

allanrbo commented Jan 13, 2020

fgsch commented Jan 15, 2020

allanrbo commented Jan 15, 2020 •

edited

Loading

fgsch commented Jan 16, 2020 •

edited

Loading

fgsch commented Jan 16, 2020 •

edited

Loading

allanrbo commented Jan 16, 2020 •

edited

Loading

fgsch commented Jan 19, 2020

allanrbo commented Jan 21, 2020

dune73 commented Jan 22, 2020

fgsch commented Jan 22, 2020 •

edited

Loading

dune73 commented Feb 11, 2020

dune73 commented Feb 11, 2020

RE2 compatibility for 920120 #1663

Are you sure you want to change the base?

RE2 compatibility for 920120 #1663

Conversation

allanrbo commented Jan 10, 2020 • edited Loading

Problem

Suggested rewrite

Testing

fgsch commented Jan 11, 2020 • edited Loading

fgsch commented Jan 12, 2020

dune73 commented Jan 13, 2020

allanrbo commented Jan 13, 2020

fgsch commented Jan 15, 2020

allanrbo commented Jan 15, 2020 • edited Loading

fgsch commented Jan 16, 2020 • edited Loading

fgsch commented Jan 16, 2020 • edited Loading

allanrbo commented Jan 16, 2020 • edited Loading

fgsch commented Jan 19, 2020

allanrbo commented Jan 21, 2020

dune73 commented Jan 22, 2020

fgsch commented Jan 22, 2020 • edited Loading

dune73 commented Feb 11, 2020

dune73 commented Feb 11, 2020

allanrbo commented Jan 10, 2020 •

edited

Loading

fgsch commented Jan 11, 2020 •

edited

Loading

allanrbo commented Jan 15, 2020 •

edited

Loading

fgsch commented Jan 16, 2020 •

edited

Loading

fgsch commented Jan 16, 2020 •

edited

Loading

allanrbo commented Jan 16, 2020 •

edited

Loading

fgsch commented Jan 22, 2020 •

edited

Loading