Skip to content

perf: optimize Ring.Get() when a large number of instances are in a state for which the extended replication set is requested#884

Merged
pracucci merged 7 commits intomainfrom
fix-ring-performance-on-1-zone-leaving
Jan 28, 2026
Merged

perf: optimize Ring.Get() when a large number of instances are in a state for which the extended replication set is requested#884
pracucci merged 7 commits intomainfrom
fix-ring-performance-on-1-zone-leaving

Conversation

@pracucci
Copy link
Copy Markdown
Contributor

@pracucci pracucci commented Jan 26, 2026

What this PR does:

The Ring.Get() is very inefficient when a large number of instances are in a state for which the extended replication set is requested. This is not a new issue, and there was a previous optimization attempt in #672 by @56quarters. Unfortunately the issue is not fixed yet and we can see it when the magnitude of instances increases.

In this PR I propose an optimization based on the fact that looking up small maps is slower than just searching for corresponding items in slices (maps become fasters when the number of unique items increases, but for few items slices are just faster).

The result of this optimization is not huge, but I think it's still a step forward. It's still very slow when we have a large number of instances in a state for which the extended replication set is requested, compared to the case all instances are ACTIVE.

goos: darwin
goarch: arm64
pkg: github.com/grafana/dskit/ring
cpu: Apple M3 Pro
                                                         │ new-before.txt │            new-after.txt            │
                                                         │     sec/op     │    sec/op     vs base               │
Ring_Get/with_zone_awareness-11                             1008.1n ± 19%   553.1n ±  8%  -45.13% (p=0.002 n=6)
Ring_Get/one_excluded_zone-11                                624.6n ±  1%   393.0n ± 12%  -37.08% (p=0.002 n=6)
Ring_Get/without_zone_awareness-11                           347.1n ±  4%   334.5n ±  1%   -3.62% (p=0.002 n=6)
Ring_Get/without_zone_awareness,_not_enough_instances-11     268.5n ±  1%   256.2n ±  1%   -4.56% (p=0.002 n=6)
Ring_Get_OneZoneLeaving/one_zone_leaving_=_true-11          3739.3µ ±  5%   903.0µ ±  2%  -75.85% (p=0.002 n=6)
Ring_Get_OneZoneLeaving/one_zone_leaving_=_false-11          1.905µ ±  6%   1.169µ ±  2%  -38.62% (p=0.002 n=6)
geomean                                                      2.734µ         1.643µ        -39.91%

                                                         │ new-before.txt │             new-after.txt             │
                                                         │      B/op      │     B/op      vs base                 │
Ring_Get/with_zone_awareness-11                              0.000 ± 0%       0.000 ± 0%        ~ (p=1.000 n=6) ¹
Ring_Get/one_excluded_zone-11                                0.000 ± 0%       0.000 ± 0%        ~ (p=1.000 n=6) ¹
Ring_Get/without_zone_awareness-11                           0.000 ± 0%       0.000 ± 0%        ~ (p=1.000 n=6) ¹
Ring_Get/without_zone_awareness,_not_enough_instances-11     0.000 ± 0%       0.000 ± 0%        ~ (p=1.000 n=6) ¹
Ring_Get_OneZoneLeaving/one_zone_leaving_=_true-11         109.6Ki ± 0%     124.3Ki ± 0%  +13.45% (p=0.002 n=6)
Ring_Get_OneZoneLeaving/one_zone_leaving_=_false-11          0.000 ± 0%       0.000 ± 0%        ~ (p=1.000 n=6) ¹
geomean                                                                 ²                  +2.13%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                                         │ new-before.txt │            new-after.txt            │
                                                         │   allocs/op    │ allocs/op   vs base                 │
Ring_Get/with_zone_awareness-11                              0.000 ± 0%     0.000 ± 0%        ~ (p=1.000 n=6) ¹
Ring_Get/one_excluded_zone-11                                0.000 ± 0%     0.000 ± 0%        ~ (p=1.000 n=6) ¹
Ring_Get/without_zone_awareness-11                           0.000 ± 0%     0.000 ± 0%        ~ (p=1.000 n=6) ¹
Ring_Get/without_zone_awareness,_not_enough_instances-11     0.000 ± 0%     0.000 ± 0%        ~ (p=1.000 n=6) ¹
Ring_Get_OneZoneLeaving/one_zone_leaving_=_true-11           19.00 ± 0%     27.00 ± 0%  +42.11% (p=0.002 n=6)
Ring_Get_OneZoneLeaving/one_zone_leaving_=_false-11          0.000 ± 0%     0.000 ± 0%        ~ (p=1.000 n=6) ¹
geomean                                                                 ²                +6.03%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

Which issue(s) this PR fixes:

N/A

Checklist

  • Tests updated

Note

Improves performance of Ring.Get() when many instances extend the replica set, especially under zone-awareness.

  • Replaces distinctHosts slice/map usage with a hybrid stringSet (util.go) to reduce lookups/allocations; adds comprehensive tests
  • Switches per-zone tracking from maps to slices indexed by zone index (totalHostsPerZone, examinedHostsPerZone, foundHostsPerZone); updates canStopLooking() signature accordingly
  • Adds explicit zoneIndex resolution with validation and panics on misuse if zone-awareness disabled
  • Adjusts main selection loop to use stringSet.len()/contains()/add() and new stop condition
  • Expands and parameterizes BenchmarkRing_Get_OneZoneLeaving (larger instances, more zones, and both with/without one zone leaving) with deterministic assertions

Net effect: substantial reductions in Ring.Get() time in pathological scenarios; no API changes.

Written by Cursor Bugbot for commit 9f1c505. This will update automatically on new commits. Configure here.

@pracucci pracucci changed the title perf: Optimize Ring.Get() when a large number of instances are in a state for which the extended replication set is requested perf: optimize Ring.Get() when a large number of instances are in a state for which the extended replication set is requested Jan 26, 2026
@pracucci pracucci requested a review from 56quarters January 26, 2026 08:59
@pracucci pracucci assigned alexweav and unassigned alexweav Jan 26, 2026
@pracucci pracucci requested a review from alexweav January 26, 2026 09:46
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci force-pushed the fix-ring-performance-on-1-zone-leaving branch from 9f1c505 to 7d33cd7 Compare January 27, 2026 04:12
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Copy link
Copy Markdown
Contributor

@56quarters 56quarters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nicely done.

t.Run("zero capacity buffer switches to map immediately on first add", func(t *testing.T) {
t.Parallel()
buf := make([]string, 0)
s := newStringSet(buf)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe assert that setMap is nil before any adds?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 2b36903

Copy link
Copy Markdown
Contributor

@alexweav alexweav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci enabled auto-merge (squash) January 28, 2026 14:45
@pracucci pracucci merged commit 18df891 into main Jan 28, 2026
10 checks passed
@pracucci pracucci deleted the fix-ring-performance-on-1-zone-leaving branch January 28, 2026 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants