Skip to content

fix(limit-req): Make Redis path atomic via EVAL + use hash key with TTL#12605

Open
falvaradorodriguez wants to merge 12 commits intoapache:masterfrom
falvaradorodriguez:fix/limit-req-plugin-redis-atomic
Open

fix(limit-req): Make Redis path atomic via EVAL + use hash key with TTL#12605
falvaradorodriguez wants to merge 12 commits intoapache:masterfrom
falvaradorodriguez:fix/limit-req-plugin-redis-atomic

Conversation

@falvaradorodriguez
Copy link

@falvaradorodriguez falvaradorodriguez commented Sep 9, 2025

Description

The current limit-req Redis implementation uses two separate keys (excess and last) and updates them with multiple GET/SET operations.

  • Under concurrent load, this leads to race conditions:
  1. Several workers may read stale values in parallel and overwrite each other.
  2. As a result, the plugin allows more requests than expected, effectively bypassing the intended rate limit.
  • On Redis Cluster, the current approach is also problematic: atomic EVAL cannot be executed across two different keys located on different slots.

Solution

  • Store both values (excess and last) under a single Redis hash key, so the state is managed as one unit.

  • Use a single EVAL script that performs read → compute → write atomically inside Redis, removing race conditions. This approach is consistent with how the limit-count plugin already works.

  • Add a TTL to the key to avoid buildup of stale state.

  • Preserve existing semantics: the first request with no prior state does not consume tokens.

Which issue(s) this PR fixes:

Fixes #12592

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@falvaradorodriguez falvaradorodriguez force-pushed the fix/limit-req-plugin-redis-atomic branch from 25d03d5 to 3530220 Compare September 9, 2025 15:23
- Switch Redis storage to a single hash key (cluster compatibility)
- Perform read/compute/write atomically with EVAL
- Keep first-hit behavior (no cost on missing state)
- Add EX-based TTL to avoid key buildup
@falvaradorodriguez falvaradorodriguez force-pushed the fix/limit-req-plugin-redis-atomic branch from 3530220 to d60f099 Compare September 9, 2025 15:27
@falvaradorodriguez falvaradorodriguez changed the title fix: Make Redis path atomic via EVAL + use hash key with TTL fix (limit-req): Make Redis path atomic via EVAL + use hash key with TTL Sep 9, 2025
@falvaradorodriguez falvaradorodriguez changed the title fix (limit-req): Make Redis path atomic via EVAL + use hash key with TTL fix(limit-req): Make Redis path atomic via EVAL + use hash key with TTL Sep 9, 2025
@falvaradorodriguez falvaradorodriguez marked this pull request as ready for review September 10, 2025 09:04
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. bug Something isn't working labels Sep 10, 2025
@Baoyuantop
Copy link
Contributor

‌Hi @falvaradorodriguez, we need to add the test case for this fix.

@Baoyuantop Baoyuantop moved this to 👀 In review in ⚡️ Apache APISIX Roadmap Sep 10, 2025
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Sep 11, 2025
@Baoyuantop
Copy link
Contributor

Hi @falvaradorodriguez, there are failed CI that need fixing.

@falvaradorodriguez falvaradorodriguez force-pushed the fix/limit-req-plugin-redis-atomic branch 3 times, most recently from 40a7f03 to af4d670 Compare September 22, 2025 12:30
@falvaradorodriguez falvaradorodriguez force-pushed the fix/limit-req-plugin-redis-atomic branch from af4d670 to 6399925 Compare September 22, 2025 13:05
@falvaradorodriguez
Copy link
Author

falvaradorodriguez commented Sep 22, 2025

Hi @falvaradorodriguez, there are failed CI that need fixing.

Hi @Baoyuantop!

The failures are not related to the current changes. Tests of the modified plugin seems to be fine.

Please, Can you review?

Thanks

[15:23:12] t/plugin/limit-req-redis-cluster.t ......... ok    15865 ms ( 0.01 usr  0.01 sys +  0.91 cusr  2.03 csys =  2.96 CPU)
[15:23:29] t/plugin/limit-req-redis.t ................. ok    16682 ms ( 0.02 usr  0.00 sys +  0.95 cusr  2.23 csys =  3.20 CPU)
2025/09/22 15:23:37 Processed 0 requests
[15:23:41] t/plugin/limit-req.t ....................... ok    11856 ms ( 0.02 usr  0.00 sys +  0.83 cusr  1.78 csys =  2.63 CPU)
[15:23:46] t/plugin/limit-req2.t ...................... ok     5010 ms ( 0.01 usr  0.00 sys +  0.81 cusr  0.42 csys =  1.24 CPU)
[15:23:47] t/plugin/limit-req3.t ...................... ok     1290 ms ( 0.00 usr  0.00 sys +  0.39 cusr  0.22 csys =  0.61 CPU)


if commit == 1 then
redis.call("HMSET", state_key, "excess", new_excess, "last", now)
redis.call("EXPIRE", state_key, 60)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can TTL be dynamically configured based on the rate-limiting window period, or allow user configuration?

Copy link
Author

@falvaradorodriguez falvaradorodriguez Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the limit-req plugin, there is no time window like in the limit-count plugin.

This configuration is simply done to prevent keys from remaining dead in redis. Currently, if a consumer stops making requests, their key remains in redis until it is deleted by an action independent of Apisix.

It could be made configurable, but in my opinion, it would not add much value to this plugin. It would also be an extra feature, not related with this fix.

In my opinion, since limit-req is always for 1 second, with a 1 minute of margin in redis, it is acceptable. If the consumer makes requests after 1 minute, the flow management would be the same as for the first request.

@luarx
Copy link

luarx commented Oct 8, 2025

When will this PR be merged? I am having the same issue that this PR fixes 🙏

@Baoyuantop
Copy link
Contributor

I will promptly urge other community maintainers to review.

@Baoyuantop Baoyuantop added the wait for update wait for the author's response in this issue/PR label Dec 24, 2025
Precompute the Lua script SHA1 locally and always execute scripts via
EVALSHA to avoid repeated SCRIPT LOAD operations.

Add a robust NOSCRIPT fallback to EVAL to ensure compatibility with both
resty.redis and resty.rediscluster, especially in Redis Cluster setups
where scripts are cached per node.

This improves performance and makes script execution resilient to Redis
node restarts, failovers, and resharding.
@falvaradorodriguez
Copy link
Author

Hi @falvaradorodriguez, it appears there are test failures in CI that need to be fixed.

There was a node management issue with Redis.cluster. It seems that loading the script does not guarantee that it will be available in the next sha1 evaluation. I have changed the fallback to eval to ensure that the script is executed. Once executed with eval, it can be checked with sha1.

@Baoyuantop Baoyuantop added awaiting review and removed wait for update wait for the author's response in this issue/PR user responded labels Dec 25, 2025
@Baoyuantop
Copy link
Contributor

Hi @falvaradorodriguez, could you please fix the lint bug?

@Baoyuantop Baoyuantop added wait for update wait for the author's response in this issue/PR and removed awaiting review labels Dec 26, 2025

if commit == 1 then
redis.call("HMSET", state_key, "excess", new_excess, "last", now)
local ttl = math.ceil(burst / rate) + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!


local s
if type(err) == "table" then
s = tostring(err[1] or err.err or err.message or err.msg or err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last time I checked, the type of err was always a string. Did you notice something different that you added err.err or err.message or err.msg or err?

Copy link
Author

@falvaradorodriguez falvaradorodriguez Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the note. In resty.redis the error is indeed typically returned as a plain string.

However, since this code path is meant to work with both resty.redis and resty.rediscluster, I added a small defensive handling for the case where errors may be propagated as a structured table (e.g. containing err/message fields) instead of a raw string.

This doesn’t change the behavior for the common case, but makes the NOSCRIPT detection more robust across Redis standalone and Redis Cluster setups, especially during redirects/failovers where error formats may differ.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changes you mentioned here, what I wrote above doesn't makes sense. I have replicated the way you handle the error you use in the limit-conn plugin.

else
excess = 0
-- If the script isn't cached on this Redis node, fall back to EVAL.
if not res and is_noscript_error(err) then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observed that reliability of redis evalsha with redis-cluster was very low.
#12872 (comment)

So I decided to not use evalsha when the policy is set to rediscluster. You can refer the PR I just linked to see how I did it.

Let me know if you have any better ideas tho.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, and it is also good practice to keep the logic of the plugins aligned. I have tried to replicate the functionality you mention here: 05657f6

Copy link
Contributor

@shreemaan-abhishek shreemaan-abhishek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please resolve the conflicts too.

@Baoyuantop Baoyuantop removed the wait for update wait for the author's response in this issue/PR label Feb 24, 2026
@Baoyuantop Baoyuantop requested a review from Copilot March 11, 2026 09:16
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes race conditions in the limit-req Redis/Redis-Cluster policies by moving the limiter state into a single Redis hash key and updating it atomically via a single Lua EVAL script (with EVALSHA fast-path for standalone Redis), aligning behavior with the intended rate-limit semantics under concurrency.

Changes:

  • Replaces multi-key GET/SET updates with an atomic Redis Lua script operating on one hash key (excess + last) plus TTL.
  • Enables EVALSHA optimization for the standalone Redis policy; uses EVAL for Redis Cluster for reliability.
  • Adds test cases for Redis and Redis Cluster to validate hash structure usage, TTL presence, and basic limiting behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
apisix/plugins/limit-req/util.lua Implements atomic Redis script (hash-based state + TTL) with EVALSHA fallback logic.
apisix/plugins/limit-req/limit-req-redis.lua Enables use_evalsha for standalone Redis limiter instances.
apisix/plugins/limit-req/limit-req-redis-cluster.lua Disables use_evalsha for cluster limiter instances (uses EVAL).
t/plugin/limit-req-redis.t Adds tests for hash-key state + TTL and a rapid-request rejection case for Redis policy.
t/plugin/limit-req-redis-cluster.t Adds analogous tests for Redis Cluster policy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +27 to +29
local rate = tonumber(ARGV[1]) -- req/s
local now = tonumber(ARGV[2]) -- ms
local burst = tonumber(ARGV[3]) -- req/s
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Lua script comments label rate/burst as "req/s", but in this limiter implementation they are passed around in the same scaled units used by the original algorithm (rate = configured_rate * 1000, burst = configured_burst * 1000). Updating these comments to reflect the actual units would reduce the risk of future logic changes introducing subtle math/TTL bugs.

Suggested change
local rate = tonumber(ARGV[1]) -- req/s
local now = tonumber(ARGV[2]) -- ms
local burst = tonumber(ARGV[3]) -- req/s
local rate = tonumber(ARGV[1]) -- scaled request rate (configured_rate * 1000)
local now = tonumber(ARGV[2]) -- ms
local burst = tonumber(ARGV[3]) -- scaled burst (configured_burst * 1000)

Copilot uses AI. Check for mistakes.
Comment on lines +699 to +700
local vals = red:hmget("limit_req:{test_key}:state", "excess", "last")
if vals[1] and vals[2] then
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resty.redis:hmget returns (vals, err). This test only captures the first return value and then unconditionally indexes vals[1]/vals[2], which will throw a Lua runtime error if hmget fails and returns nil. Capture err and guard for not vals (and ideally print the error) to make the test robust and failures diagnosable.

Suggested change
local vals = red:hmget("limit_req:{test_key}:state", "excess", "last")
if vals[1] and vals[2] then
local vals, err = red:hmget("limit_req:{test_key}:state", "excess", "last")
if not vals then
ngx.say("failed to hmget: ", err)
elseif vals[1] and vals[2] then

Copilot uses AI. Check for mistakes.
Copy link
Member

@moonming moonming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @falvaradorodriguez, thank you for making the Redis limit-req path atomic via EVAL!

Using a Lua script with EVAL to ensure atomicity of the rate limiting operation is the correct approach — the current non-atomic multi-command flow can have race conditions under high concurrency. With 12 reviews, this has been thoroughly discussed.

To confirm readiness:

  1. Are all 12 review comments addressed?
  2. Has the EVAL script been tested under concurrent load to verify it resolves the race condition?
  3. The hash key with TTL approach — does it handle key expiration correctly for sliding windows?

This is an important correctness fix. Let's get it finalized! Thank you.

@Baoyuantop
Copy link
Contributor

Hi @falvaradorodriguez, I see no problems with the code here. Could you please fix the failing test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files.

Projects

Status: 👀 In review

Development

Successfully merging this pull request may close these issues.

bug: limit-req plugin does not work correctly when configuring redis or redis-cluster

7 participants