Skip to content

Conversation

@sameh-farouk
Copy link
Member

@sameh-farouk sameh-farouk commented Nov 19, 2025

Closes: #237

Summary
This PR introduces a robust, multi-layered cleanup strategy for Redis streams to prevent memory bloat from expired messages and ensure long-term stability. It adds probabilistic, time-based trimming for active streams using XTRIM MINID and aligns the existing EXPIRE mechanism with client-side TTLs.

Details:
This change implements a two-part strategy where each mechanism has a distinct role:

  • XTRIM MINID for Active Streams (The Housekeeper)
    On every send, there is now a 10% chance of triggering an XTRIM MINID <now - 30_minutes> command.
    This efficiently removes any messages older than the maximum message TTL (30 minutes) from within an active stream.

  • EXPIRE for Inactive Streams (The Garbage Collector)
    The existing EXPIRE command is retained. Its purpose is to remove streams entirely for clients who have been disconnected for more than 30 minutes.

Together, XTRIM and EXPIRE cover all scenarios, ensuring that both active and inactive streams are cleaned efficiently.

Key Design Decisions & Rationale
This implementation includes several important design choices that were carefully considered:

  • Why is XTRIM MINID so Efficient?
    This approach is highly clever because it leverages the fact that a Redis Stream ID contains the message's creation timestamp (<timestamp_ms>-<seq>). XTRIM MINID uses this embedded timestamp to delete old messages natively within Redis, without the application needing to read, deserialize, and inspect message envelopes. This avoids significant CPU and network overhead.

  • Why Probabilistic 10% Sampling?

    • vs. 100% (Every Send): Running XTRIM on every send would create excessive network and CPU overhead for marginal benefit. 10% provides an excellent balance, keeping streams clean without taxing the system.
    • vs. A Periodic Job: A background job that runs every 30 minutes would need to SCAN the entire Redis keyspace, which is inefficient and causes CPU spikes. The probabilistic "on-send" approach has a smooth performance profile and scales up and down naturally with stream activity. high traffic triggers more cleanups, while low traffic results in fewer.
  • Why XTRIM is a Separate, Best-Effort Command?
    The XTRIM command is intentionally sent after the main XADD/EXPIRE pipeline. This is a critical safety measure. If XTRIM were in the main pipeline, a rare failure would cause the entire send operation to fail. By separating it, we ensure the non-critical cleanup task can never interfere with the critical path of message delivery.

How the probabilistic approach was tested:
The probabilistic approach was validated with two targeted tests:

  • Statistical Correctness (Law of Large Numbers):
    A test script was run to simulate the 10% probability check over a small sample (100 runs) and a large sample (100,000 runs).
    Result: The large sample produced a hit rate average of 9.89%, confirming that over the high volume of messages a relay handles, the cleanup rate reliably converges on the configured 10%.

  • Activity-Based Scaling:
    A second test simulated a mix of high-activity streams (70 total messages) and low-activity streams (30 total messages).
    Result: The high-activity group, responsible for 70% of the messages, received ~72% of the XTRIM cleanup operations.

Conclusion: Tests prove that the probabilistic model naturally and automatically directs more cleanup efforts to the streams that need it most, without any complex state tracking.

What's changed:

  • Introduce probabilistic time-based stream trimming using XTRIM MINID which efficiently cleans expired messages on busy streams.
  • reducing QUEUE_EXPIRE to 30 minutes from 1 hour. An idle stream will now be deleted around the same time its last possible message would have expired anyway. It's a logical and beneficial adjustment.
  • Add new unit tests to validate the probabilistic approach

- Implemented 10% probabilistic XTRIM on message sends to remove expired messages older than 30 minutes
- Reduced QUEUE_EXPIRE from 1 hour to 30 minutes to align with MAX_TTL
- Added tests verifying probability distribution and activity-based scaling
@sameh-farouk sameh-farouk force-pushed the development-implements-active-streams-housekeeper branch from 7aaf3aa to aed8fbd Compare November 19, 2025 02:16
@sameh-farouk sameh-farouk marked this pull request as ready for review November 19, 2025 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Efficient Expired Message Cleanup

2 participants