-
Notifications
You must be signed in to change notification settings - Fork 727
RE2 compatibility for 920120 #1663
base: v3.3/dev
Are you sure you want to change the base?
RE2 compatibility for 920120 #1663
Conversation
I haven't looked at this in detail yet but it occurs to me we could move it to util/regexp-assemble/regexp-920120.data to help readability, at least for the non-assembled version (assuming the generated version works the same way that the one in this PR). |
Having had another look, I realise the pattern presented is generated by a script. My bad, sorry. Perhaps we should include your script in a similar way we include As for the pattern, I believe you could extract the common suffix as well ( Finally, as with previous cases replacing lookarounds, the resulting match will be different. |
Awesome PR and very interesting blogpost. Thank you very much @allanrbo. I second @fgsch's remark about inclusion of the generator into our repository. I have also looked up 920120 in my logs to see if we can come up with additional FTW tests - the 3 we have are a bit thin - but no luck. However, I suggest to add tests to the PR based on Allan's work and documentation. This will be useful as we move forward. |
Thanks for reading. Good idea with including the generator script in this repo here, so it doesn't get lost. Will update the PR. Will also see if I can make some more FTW tests from those manual tests I mentioned above actually. |
Thanks for updating the PR. Have you seen this part of my comment?
|
Sorry, overlooked those questions. Good questions. Regarding extracting I don't think it's possible to extract out the
And further consider just this part:
The Regarding perf With regards to performance, yes, it's worse. So it's a tradeoff. Especially for WAFs running ModSecurity, which is currently probably close to 100% of users. The reason I still propose it is that it paves the way to make CRS useful on non-ModSecurity WAFs too. Here are some numbers I've generated from this test program, which evaluates the old and new regexes 1 million times in the input
Regarding whether match is different I've taken great care to ensure it's not different. That's what I've written about here: http://allanrbo.blogspot.com/2020/01/alternative-to-negative-lookbehinds-in.html . But sure, there could be a flaw in my reasoning that I haven't realized yet :-) |
Right. I understand the nuances of ^. I was a bit hasty here and didn't analyse the pattern in detail before I suggested the move though. You are entirely correct.
Based on your tests we are talking about a ~9x performance penalty for 99.99% of users. FWIW, for Hyperscan, this is somewhat less of a problem since you can use Chimera. For RE2 unfortunately there is no alternative.
What I meant here is that while for the same input both patterns will match, what the patterns will match on will differ. Taking the input above and comparing the results before and after this change shows, for example:
|
For a simpler test limited to pcre, you can run |
Hi Federico, thanks for the input. Yea I totally get if CRS doesn't want to merge this PR. It's fine if we just close it. I know it doesn't have any short term benefit for anyone besides me. The reason I thought I'd share it anyway, was because it may have a long term benefit, as it paves the way for CRS on future WAF engines that strictly avoid PCRE (like the experiment I'm currently prototyping). I wish there was a way to achieve this without such a long, complex, and slow regex. I feel like there may be a way with chained rules, but just haven't been able to find one... Perf discussion continued Thinking of it as a 900% perf penalty I feel is kind of blunt. It's probably closer to 0.13%, and only on very specific requests. Here's why I believe so. No requests will be 9x slower, but on those specific form-data/multipart POST requests that also contain a file name field, that one operation to scan the name and file name field that used to take ModSecurity around 0.001633 ms now is now 9x slower at around 0.01467 ms. On my computer with Nginx+ModSecurity I ran a couple of thousand runs of a simple 520 byte form-data/multipart POST non-keepalive request with a file field. It averaged at around 5.189 ms. So the theoretical overhead of this should be (0.01467-0.001633)/5.189=0.01303=0.13% on these requests. I reran with the new regex, and it avereged 5.171 ms - suprisingly lower but close enough to be a measurement error. Regarding backtracking RE2 will never backtrack. They achieve |-alternatives without backtracking, purely via state machines I think: https://github.com/google/re2/wiki/WhyRE2 . pcretest I couldn't find pcretest, but found pcre2test in Ubuntu. Assuming it's equivalent. Ran Whether match is different I understand what you mean now. Yes, TX:0 will be slightly different with the new regex, but since TX:0 is not used I don't think this is a problem? |
It's too early to tell either way (whether this will be merged) but the more information we have, the more informed the decision will be.
This will be the worse case scenario based on your tests, but still something we should consider.
I was referring to PCRE here, the only supported engine in ModSecurity at this time.
TX.0 and MATCHED_VAR are basically the same, so it will affect logdata but as I mentioned previously I don't think it's a problem, just something to have in mind. |
Whether match is different I believe MATCHED_VAR will be the same before and after, because MATCHED_VAR contains the entire field that was matched, and not just the substring that the regex matched. For example I just tested this rule
when given the URL
produces this logdata
|
That's the correct and expected behavior for MATCHED_VAR, I think. |
I stand corrected. As you said, there will be no difference in logging with this change. |
Meeting decision: @dune73 will look into this, namely the viability of the performance impact and we can then see if we accept the PR (or drop it). |
Adding the needs-action label to express the fact that we need hard performance data from production. |
Same as SpiderLabs/owasp-modsecurity-crs#1663 . I doubt they will accept the PR in their repo as the downsides seem to outweigh the benefits for them. PR URL: https://msazure.visualstudio.com/DefaultCollection/One/_git/Networking-Azwaf/pullrequest/2495892 Related work items: #5880651
The regex backreference support works by parsing regexes at initialization to detect the presence of backreferences such as `\1`, `\2`, etc. These are then replaced with the content of the group that they were referencing. So for example `(abc+) \1` becomes `(abc+) (abc+)`. The `tryMatch` function then does the actual comparison to ensure that the two groups actually had the same content. Only groups that are surrounded by `\b` word boundaries are supported, such as `\b(abc+)\b`. This is because otherwise it becomes really complicated with situations like `11=111111`. The CRS also contains this construct that makes use of a negative lookahead to assert that a backreference is _not equal_ to a group (simplified): `\b(\w+)\b != (?!\b(\1)\b)(\w+)`. Support for these special construct is therefore also implemented. The changes to CRS rule 942130 is currently pending checkin: RE2 compatibility for 920120 SpiderLabs/owasp-modsecurity-crs#1663 Because this gets us to 100% of the CRS tests passing, I have removed code that was preventing running the CRS tests by default in the CI pipeline. So the CRS tests now always run by default. PR URL: https://msazure.visualstudio.com/DefaultCollection/One/_git/Networking-Azwaf/pullrequest/2668744 Related work items: #5697858
Problem
Negative lookbehinds are not supported by many regex engines (RE2, Go's regexp, Hyperscan).
My motivation for avoiding this PCRE-specific construct is that I am working on an experimental WAF that uses CRS, but uses Hyperscan and Go's regexp package rather than PCRE. I'm hoping to be able to keep using vanilla CRS, so this is why I'm hoping to get this accepted in upstream. Others who are also doing similar experiments with non-PCRE regex engines will benefit too.
Suggested rewrite
The goal of the original regex was to disallow semicolon, except if it was preceeded by some strings that make it a well known HTML entity, such as
ä
,â
, etc.The approach of my solution is to write alternatives for each possible prefix that is not
&[aAoOuUyY]uml;
,&[aAeEiIoOuU]circ;
, etc.I have described the approach in detail here, and posted a Python script I used to generate the final regex: http://allanrbo.blogspot.com/2020/01/alternative-to-negative-lookbehinds-in.html
This is a tough tradeoff in human readability of the rule. If anyone has any ideas for a more readable approach, I'd be very eager to hear.
One alternative approach I've explored but failed at, is to split the rule into multiple chained rules. The first rule in the chain could catch something like
(&[^;]+)?;
. The second rule in the chain could then be a negative check, something likeSecRule TX:1 "!@rx (&[aAoOuUyY]uml;|&[aAeEiIoOuU]circ|...)" ...
. Unfortunately this fails when the input has multiple semicolons. The first semicolon may be preceeded byä
, thereby only evaluating the second chained rule for this TX:1, ignoring subsequent semicolons which may contain the attack.Testing
Ran all the FTW regression tests.
Manually tested via regex101.com:
Additionally wrote a little Go program to test against a few different regex engines: https://gist.github.com/allanrbo/93fb64549151169ff3a8a107cce8ce1b