Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: adds memoize implementation for regexes and ahocorasick #836

Merged
merged 10 commits into from
Aug 6, 2023

Conversation

jcchavezs
Copy link
Member

@jcchavezs jcchavezs commented Jul 5, 2023

Currently we create and allocate memory for every regex we compile, however there are cases where you compile the same regex over and over e.g. corazawaf/coraza-caddy#76. Here we implement the memoize pattern to be able to reuse the regex and reduce the memory consumption.

TODO:

  • adds docs to code

Ping @llinder @anuraaga @jptosso

Currently we create and allocate memory for every regex we compile, however there are cases where you compile the same regex over and over e.g. corazawaf/coraza-caddy#76. Here we implement the memoize pattern to be able to reuse the regex and reduce the memory consumption.
@jcchavezs jcchavezs requested a review from a team as a code owner July 5, 2023 20:16
@codecov
Copy link

codecov bot commented Jul 5, 2023

Codecov Report

Patch coverage: 71.79% and project coverage change: -4.24% ⚠️

Comparison is base (cecd79c) 81.59% compared to head (8d52562) 77.36%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #836      +/-   ##
==========================================
- Coverage   81.59%   77.36%   -4.24%     
==========================================
  Files         159      157       -2     
  Lines        9013     8738     -275     
==========================================
- Hits         7354     6760     -594     
- Misses       1412     1749     +337     
+ Partials      247      229      -18     
Flag Coverage Δ
default ?
examples ?
ftw ?
ftw-multiphase 48.93% <56.41%> (-0.05%) ⬇️
tinygo 74.77% <66.66%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
internal/operators/validate_nid.go 46.59% <0.00%> (ø)
internal/corazawaf/rule.go 93.43% <45.45%> (-1.49%) ⬇️
internal/seclang/directives.go 71.98% <50.00%> (-5.39%) ⬇️
internal/memoize/noop.go 100.00% <100.00%> (ø)
internal/operators/from_file.go 86.95% <100.00%> (ø)
internal/operators/pm.go 91.89% <100.00%> (ø)
internal/operators/pm_from_dataset.go 100.00% <100.00%> (ø)
internal/operators/pm_from_file.go 89.28% <100.00%> (+0.39%) ⬆️
internal/operators/restpath.go 75.00% <100.00%> (ø)
internal/operators/rx.go 84.61% <100.00%> (-0.39%) ⬇️

... and 23 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jptosso
Copy link
Member

jptosso commented Jul 5, 2023

We should include a flush or reset function so we can delete everything after a web server reload. That function must be exported. I would include it in experimental/coraza

@jcchavezs
Copy link
Member Author

That function must be exported. I would include it in experimental/coraza

There is a Close function https://github.com/corazawaf/coraza/pull/836/files#diff-48bb48b365a730502b9fc1bf2654dfbb3f84c124a02bad1a68e8b3ae708a8707R52, I guess we need to figure out how to expose it and most likely we will expose it thorugh the WAF.Close.

@jptosso
Copy link
Member

jptosso commented Jul 5, 2023

Waf close is not coming in v3 as we cannot break the api, we should just export all helpers in experimental until we can merge them. But closing cache is different, as this is global and no per-WAF

@jcchavezs
Copy link
Member Author

Waf close is not coming in v3 as we cannot break the api, we should just export all helpers in experimental until we can merge them. But closing cache is different, as this is global and no per-WAF

Not a breaking change if we don't force people to use it. ATM there is no close method and people still live with it. Although adding Close "breaks" the interface people is not supposed to implement their own WAF right?

@jcchavezs
Copy link
Member Author

If adding the method to the interface is too concerning we could:

  1. Add close method to the WAF implementation (internal package)
  2. Create a CloseWAF(waf) function in the experimental package and let it cast the internal type and call the Close method.

@jcchavezs
Copy link
Member Author

I believe we can't do this for proxy-wasm but I do believe we might have the same issue with different wafs for authorities. Any thought @anuraaga ?

@anuraaga
Copy link
Contributor

anuraaga commented Jul 5, 2023

Is the question whether we can add a Close method? As I mentioned before, we could have documented it better already but I don't think it's too late to document the waf interface is not for external implementation without worrying about major version, it's very early still.

For this PR though, I don't see how a WAF close method would help though since IIUC the cache is global, not per waf. Having a global method in experimental for now seems ok.


//go:build !tinygo

// Highly inspired in https://github.com/patrickmn/go-cache/blob/master/cache.go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok but this file looks like an obvious mutex guarded map, not sure it's worth referencing anything.

Anyways it looks like a sync.Map should be much faster for this, it's docs specifically mention the write once read many case of a cache

@jcchavezs
Copy link
Member Author

PTAL @anuraaga @jptosso I did some simplifications and removed the Close method.

I believe in coraza-caddy and WAF in general is that you don’t do radical changes between deployments, e.g. if you change your ruleset you won’t change 100 regexes, mostly just a couple so it is still OK to keep the global cache that will keep alive a few remaining regexes until next deployment.

If we want to really be precise we should do something like ref counting which is much more problematic so I think this should be a good enough solution.

@@ -456,7 +457,12 @@ func (r *Rule) AddVariable(v variables.RuleVariable, key string, iscount bool) e
var re *regexp.Regexp
if len(key) > 2 && key[0] == '/' && key[len(key)-1] == '/' {
key = key[1 : len(key)-1]
re = regexp.MustCompile(key)

if vare, err := memoize.Do(key, func() (interface{}, error) { return regexp.Compile(key) }); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be worth extracting a function for the two usages

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean for the regex and binaryregex?

func makeDoer(cache *sync.Map, group *singleflight.Group) func(string, func() (interface{}, error)) (interface{}, error, bool) {
return func(key string, fn func() (interface{}, error)) (interface{}, error, bool) {
// Check cache
value, found := cache.Load(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, can combine two lines

Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah we need to add a ci build with the new flag. In particular with tinygo

@jcchavezs
Copy link
Member Author

@anuraaga we do that already, see:

https://github.com/corazawaf/coraza/pull/836/files#diff-520109d035a196d29d0a86e9c4e6c98e98abf056141b84df68c0d48167b87b25R114-R126

As for tinygo, we already run tests for tinygo https://github.com/corazawaf/coraza/blob/main/.github/workflows/tinygo.yml#L49 and we don't have to enable memoize_regex there as it won't work without synchronizer anyways.

the same process and hence same regexes are being compiled
over and over.

Currently it is opt-in under the `memoize_regex` build tag
Copy link
Member

@M4tteoP M4tteoP Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: what about also adding a one-line description in the main readme, under https://github.com/corazawaf/coraza#build-tags?

@anuraaga
Copy link
Contributor

anuraaga commented Jul 6, 2023

Ah missed the magefile change.

But this doesn't need to apply to proxy-wasm? If it's only caddy that seems more narrow than I expected. Yeah I know I recommended sync.map but forgot about TinyGo, we should probably have a non-sync version for it to still have memorization with Envoy, where the memory penalty of multiple wafs is way higher than caddy

@jcchavezs
Copy link
Member Author

jcchavezs commented Jul 6, 2023 via email

@anuraaga
Copy link
Contributor

anuraaga commented Jul 6, 2023

BTW TinyGo does have an implementation of sync.map, which of course doesn't actually synchronize

https://github.com/tinygo-org/tinygo/blob/release/src/sync/map.go

So I think the current code should work but we should make sure in CI

@jcchavezs
Copy link
Member Author

I will check and also adds memoize for aho-corasick trees.

@@ -521,7 +527,11 @@ func (r *Rule) AddVariableNegation(v variables.RuleVariable, key string) error {
var re *regexp.Regexp
if len(key) > 2 && key[0] == '/' && key[len(key)-1] == '/' {
key = key[1 : len(key)-1]
re = regexp.MustCompile(key)
if vare, err := memoize.Do(key, func() (interface{}, error) { return regexp.Compile(key) }); err != nil {
panic(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never panic, you can return error here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, however this should be fixed in main, I am just reproducing the MustCompile behaviour.

@jptosso jptosso changed the title chore: adds memoize implementation for regexes. chore: adds memoize implementation for regexes and ahocorasick Aug 5, 2023
@jcchavezs
Copy link
Member Author

PTAL @anuraaga

@@ -106,6 +106,9 @@ have compatibility guarantees across minor versions - use with care.
the operator with `plugins.RegisterOperator` to reduce binary size / startup overhead.
* `coraza.rule.multiphase_valuation` - enables evaluation of rule variables in the phases that they are ready, not
only the phase the rule is defined for.
* `memoize_builders` - enables memoization of builders for regex and aho-corasick
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @anuraaga

@jcchavezs jcchavezs merged commit 4f30afe into main Aug 6, 2023
6 of 8 checks passed
@jcchavezs jcchavezs deleted the memoize_regex branch August 6, 2023 07:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants