-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fault-tolerance for cache system errors #577
base: main
Are you sure you want to change the base?
Conversation
- Catch and handle all errors once per request. - Remove the `rescuing` blocks from the store proxies; rescuing per-method (read, write, increment) is bad because (a) it may result in undefined behavior, and (b) it will trigger repeated connection timeouts if your cache is down, e.g. N * M * timeout latency where N is the number of Rack::Attack metrics and M is the cache requests per metric. - Add `Rack::Attack.ignored_errors` config. This defaults to Dalli::DalliError and Redis::BaseError. - Add `Rack::Attack.failure_cooldown` config. This temporarily disables Rack::Attack after an error occurs (including ignored errors), to prevent cache connection latency. The default is 60 seconds. - Add `Rack::Attack.error_handler` which takes a Proc for custom error handling. It's probably not needed but there may be esoteric use cases for it. You can also use the shortcut symbols :block, :throttle, and :allow to respond to errors using those. - Add `Rack::Attack.calling?` method which uses Thread.current (or RequestStore, if available) to indicate that Rack::Attack code is executing. The reason for this is to add custom error handlers in the Rails Cache, i.e. "raise the error if it occurred while Rack::Attack was executing, so that Rack::Attack and handle it." Refer to readme. - Add "Fault Tolerance & Error Handling" section to Readme which includes all of the above.
a141aa1
to
ae9d7fd
Compare
@grzuy can you please review this PR? |
FYI I've been running this in production now for several weeks with no issue. |
@grzuy any consideration to merge this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope this PR can be merged. Just adding a few points and a review in case helpful
- Improvements to readme
@timdiggins I've updated the PR in response to your comments. Thanks! |
@grzuy any chance to merge this? I've been using in Production for 6 months. |
@grzuy please review. |
I love the idea of this PR. I just have one concern about it being enabled by default. If an attacker is able to influence a system to enter the bypass mode, they might be able to compromise a critical security feature in some cases. Consider if someone uses Personally, I'd classify Of course, I could be wrong, and maybe there are mitigating factors here I don't understand. I'm happy to be corrected. Thanks either way! |
@justinhoward Im fine to disable this behavior by default on this PR. If i do that can someone please merge? |
I am very interested in this too, I'm running into problem cases similar to described in original issue -- if your store is misbehaving, it can wind up bringing your whole app down. Which is pretty disastrous. Ability to configure to in some cases just ignore rack-attack failure and proceed with app processing seems crucial. |
Actually, now trying to trace what's going on in my app, I am realizing that when using Memcached/Dalli, similar behavior may actually already be built into Dalli? eg But I think you really do want to suspend rack-attack at this higher level notwithstanding... I realize I am not sure I understand what's going on. |
@grzuy please review. |
@grzuy any chance of getting this into master? I just had my production app nearly taken out by a bad memcached! |
Hi @johnnyshields, I think that the problem you are solving is legit, thanks for your contribution 🙌 As you said, this PR adds several features, which IMO makes it hard to be accepted and merged by maintainers. Maybe if we keep this PR as simple as possible, and separate the other features/improvements (like What comes to my mind is something like:
Maybe this could be done by keeping the existing catch statements and re-raising a new custom tailored exception rescue Rack::Attack::CacheError => e
Rack::Attack.exception_handler.call(e)
@app.call(env)
end My question is: if we do that, will it have solved the issue you experienced and the ones described in #511? What else would be missing? Does the solution has any drawback in terms of behavior, security, or any other aspect? |
@santib I am happy to try to break this into a series of smaller PRs if I can get your commitment to review and merge in a timely manner. FWIW I've served billions of requests at TableCheck in the last ~1 year using this code, with no issues. |
@henryaj welcome to the club! |
That would be possible, but since the internal Dalli/RedisProxy classes would be now raising an error (instead of suppressing it as previously), from an app-owner perspective I don't think it makes a huge difference. TBH I think is desirable to have the original database-driver specific error messages, rather than catching and raising a new one. By removing all the silly |
Hey, just in case, no, I'm not a maintainer.
Agree, the difference is mostly for the maintainance of this gem, so that each store_proxy continues to encapsulate the specifics of each store. Leaving here an example of what I'm referring to: main...santib:rack-attack:system-hardening-refactor Apart from the introduced
|
Yep I understand what you are proposing. IMHO that's actually more complex than my PR. My preference is to keep things as simple as possible and just use the native errors--less code, less bloat, no new not-yet-seen error types, etc. That being said I will defer to the maintainers whether they prefer the bloat. If a maintainer can chime in here I'd appreciate it. |
@johnnyshields if you fork rack-attack, I'll use your fork! It does not appear to me that current maintainers of rack-attack are interested in improvements. |
@jrochkind it will be second gem I've forked this week 🤣 OK give me a few days I'll do it over the weekend. |
I will note that over on #578 one of the maintainers has been engaging with me, on that PR I submitted last year. He's had some great feedback, and some progress has been made just in this past week. I'm not sure I'd give up on them just yet! |
@jamiemccarthy it looks to me like over on #578 you have been discussing your PR with @santib .... who said here on this issue that they are not a maintainer.... @santib , if you give detailed feedback and requests for changes that sound like they are coming from a maintainer, and people invest their time in discussing and making those changes thinking if they meet your approval they will get their PR merged by you... they may be dismayed when they find out you actually have no influence over that process... |
@jamiemccarthy I'm sorry if I left the wrong impression 🙏 @jrochkind Don't worry, I'll make sure to always clarify I'm not a maintainer from now on 👍 What I do think is that if we make PRs simpler they'll eventually get reviewed by maintainers. But of course you don't have to trust me. |
Oops! My misunderstanding. Well, I'm not dismayed — @santib I do think your feedback has been helpful! In any case the repo has had 8 PRs merged in October, so it doesn't seem like it's abandoned. |
Thank you @johnnyshields for your work and patience on this PR. And thank you @timdiggins, @justinhoward, @jrochkind, @henryaj, @santib and @jamiemccarthy for all the helpful comments and feedback. Agree with @justinhoward and @santib comments in that we would want to narrow it down to the smallest changeset possible that address yours issue in production, being an opt-in setting. I would prefer, given the considerable amount of rack-attack users and maturity of the gem, that mostly nothing changes for existing users that don't opt-in. Thanks again. |
@grzuy Appreciate the comment. Let me try to break this into a series of smaller PRs. One thing that will be inevitable is that we need to raise an error from the Redis/DalliProxy classes to the parent level. Currently this is being swallowed inside the proxy classes themselves. Two options here:
Do you have a strong opinion on which way we should go? |
Just for what is worth, he is a mantainer now #637. |
Fixes #511.
This PR adds several features designed to increase fault-tolerance in case the cache infrastructure (Redis, Memcached, etc.) fails. This introduces a few new configs; the intention here is to make a "sensible default" that will work for 99% of real-world Rails apps, while adding customizability for the DIY crowd.
rescuing
blocks from the store proxies; rescuing per-method (read, write, increment) is bad because (a) it may result in undefined behavior, and (b) it will trigger repeated connection timeouts if your cache is down, e.g. N * M * timeout latency where N is the number of Rack::Attack metrics and M is the cache requests per metric.Rack::Attack.allowed_errors
config. This defaults to Dalli::DalliError and Redis::BaseError.Rack::Attack.failure_cooldown
config. This temporarily disables Rack::Attack after an error occurs (including allowed errors), to prevent cache connection latency. The default is 60 seconds.Rack::Attack.error_handler
which takes a Proc for custom error handling. It's probably not needed but there may be esoteric use cases for it. You can also use the shortcut symbols:block
,:throttle
, and:allow
to respond to errors using those.Rack::Attack.calling?
method which uses Thread.current (or RequestStore, if available) to indicate that Rack::Attack code is executing. The reason for this is so Rails Cache error hander can raise the error for Rack::Attack to handle (i.e. if the error occurred while Rack::Attack was executing.) Refer to readme.Rack::Attack
class; code is now split into more methods.In addition, I have added RSpec Mocks as a development dependency. Specifically for tests related to raising/handling errors, I find RSpec Mocks a heck of a lot easier to work with than MiniTest, however, if there are any MiniTest gurus out there who can suggest a pure-MiniTest way for my tests, I'd be happy to change it. Note that RSpec Mocks is added by itself without requiring all of RSpec.
Lastly, I'd like to give a special shoutout to Amazon MemoryDB for Redis for causing the failure that inspired me to write this PR.