-
-
Notifications
You must be signed in to change notification settings - Fork 47
Open
Labels
Description
Stoplight currently uses integer Unix timestamps (1-second granularity) for bucketing time-series data in the Memory data store and for score in ZSET for Redis data store. This creates some limitations:
- Precision loss in short windows. A 10-second window divides into only 10 buckets, meaning each bucket represents 10% of the window. This coarse granularity makes error rate calculations imprecise - a single bucket aging out can swing the error rate by 10%.
- No sub-second windows: High-throughput systems processing thousands of requests per second cannot use windows shorter than 1 second. For services handling 10,000 req/s, a 1-second window captures 10,000 data points, while a 100ms window would capture only 1,000 - a more appropriate sample size for fast failure detection.
For high-throughput systems, a 1-second window is just too slow - by the time the circuit opens, 10,000 failed requests have already been attempted.
Value Proposition
- React to cascading failures in 10-100ms instead of 1-10s
- Protect in-process service boundaries the same way as network boundaries
- Support high-throughput services
Open Questions
- Should bucket granularity be configurable, auto-scaled, or fixed? Should windows automatically select appropriate bucket granularity, or require explicit configuration?
- At what request volume do synchronized metrics to Redis become less valuable? Indeed, each light could make its own decision quickly using local data. Synchronizing decisions at the same time seems like still a good idea.
- How do users specify sub-second windows?
window_size: 0.5for 500ms? - Should we recommend Memory data store for high-throughput circuits and Redis only for coordinated low-medium traffic?
- Can Lua scripting in Redis handle that many writes/s efficiently, or do we need batching strategies?
- How do sub-second windows impact the
cool_off_time(currently second-granularity)? - What happens when clock drift between instances exceeds bucket granularity?
- Should sub-second precision require monotonic clocks? How does this affect Redis synchronization?
- Do we even need to support this?