Skip to content

Commit ffee57d

Browse files
ruler: cap the number of remote eval retries (#10375)
* ruler: cap the number of remote eval retries The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that. This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%. Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Add CHANGELOG.md entry Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Fix a totally arbitrary stupid linter rule Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Use a CB instead of a rate limtier Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Revert "Use a CB instead of a rate limtier" This reverts commit b07366f. * Don't abort retries if we're over the rate limit Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Cancel reservation when context expires Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> --------- Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
1 parent 3091c3a commit ffee57d

File tree

8 files changed

+58
-23
lines changed

8 files changed

+58
-23
lines changed

CHANGELOG.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,12 @@
22

33
## main / unreleased
44

5-
* [CHANGE] Query-frontend: Add `topic` label to `cortex_ingest_storage_strong_consistency_requests_total`, `cortex_ingest_storage_strong_consistency_failures_total`, and `cortex_ingest_storage_strong_consistency_wait_duration_seconds` metrics. #10220
6-
75
### Grafana Mimir
86

97
* [CHANGE] Distributor: OTLP and push handler replace all non-UTF8 characters with the unicode replacement character `\uFFFD` in error messages before propagating them. #10236
108
* [CHANGE] Querier: pass query matchers to queryable `IsApplicable` hook. #10256
9+
* [CHANGE] Query-frontend: Add `topic` label to `cortex_ingest_storage_strong_consistency_requests_total`, `cortex_ingest_storage_strong_consistency_failures_total`, and `cortex_ingest_storage_strong_consistency_wait_duration_seconds` metrics. #10220
10+
* [CHANGE] Ruler: cap the rate of retries for remote query evaluation to 170/sec. This is configurable via `-ruler.query-frontend.max-retries-rate`. #10375
1111
* [ENHANCEMENT] Query Frontend: Return server-side `samples_processed` statistics. #10103
1212
* [ENHANCEMENT] Distributor: OTLP receiver now converts also metric metadata. See also https://github.com/prometheus/prometheus/pull/15416. #10168
1313
* [ENHANCEMENT] Distributor: discard float and histogram samples with duplicated timestamps from each timeseries in a request before the request is forwarded to ingesters. Discarded samples are tracked by the `cortex_discarded_samples_total` metrics with reason `sample_duplicate_timestamp`. #10145

cmd/mimir/config-descriptor.json

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13227,6 +13227,16 @@
1322713227
"fieldDefaultValue": "protobuf",
1322813228
"fieldFlag": "ruler.query-frontend.query-result-response-format",
1322913229
"fieldType": "string"
13230+
},
13231+
{
13232+
"kind": "field",
13233+
"name": "max_retries_rate",
13234+
"required": false,
13235+
"desc": "Maximum number of retries for failed queries per second.",
13236+
"fieldValue": null,
13237+
"fieldDefaultValue": 170,
13238+
"fieldFlag": "ruler.query-frontend.max-retries-rate",
13239+
"fieldType": "float"
1323013240
}
1323113241
],
1323213242
"fieldValue": null,

cmd/mimir/help-all.txt.tmpl

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2987,6 +2987,8 @@ Usage of ./cmd/mimir/mimir:
29872987
Override the default minimum TLS version. Allowed values: VersionTLS10, VersionTLS11, VersionTLS12, VersionTLS13
29882988
-ruler.query-frontend.grpc-client-config.tls-server-name string
29892989
Override the expected name on the server certificate.
2990+
-ruler.query-frontend.max-retries-rate float
2991+
Maximum number of retries for failed queries per second. (default 170)
29902992
-ruler.query-frontend.query-result-response-format string
29912993
Format to use when retrieving query results from query-frontends. Supported values: json, protobuf (default "protobuf")
29922994
-ruler.query-stats-enabled

cmd/mimir/help.txt.tmpl

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -737,6 +737,8 @@ Usage of ./cmd/mimir/mimir:
737737
Maximum number of rules per rule group by namespace. Value is a map, where each key is the namespace and value is the number of rules allowed in the namespace (int). On the command line, this map is given in a JSON format. The number of rules specified has the same meaning as -ruler.max-rules-per-rule-group, but only applies for the specific namespace. If specified, it supersedes -ruler.max-rules-per-rule-group. (default {})
738738
-ruler.query-frontend.address string
739739
GRPC listen address of the query-frontend(s). Must be a DNS address (prefixed with dns:///) to enable client side load balancing.
740+
-ruler.query-frontend.max-retries-rate float
741+
Maximum number of retries for failed queries per second. (default 170)
740742
-ruler.query-frontend.query-result-response-format string
741743
Format to use when retrieving query results from query-frontends. Supported values: json, protobuf (default "protobuf")
742744
-ruler.recording-rules-evaluation-enabled

docs/sources/mimir/configure/configuration-parameters/index.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2109,6 +2109,10 @@ query_frontend:
21092109
# CLI flag: -ruler.query-frontend.query-result-response-format
21102110
[query_result_response_format: <string> | default = "protobuf"]
21112111
2112+
# Maximum number of retries for failed queries per second.
2113+
# CLI flag: -ruler.query-frontend.max-retries-rate
2114+
[max_retries_rate: <float> | default = 170]
2115+
21122116
tenant_federation:
21132117
# Enable rule groups to query against multiple tenants. The tenant IDs
21142118
# involved need to be in the rule group's 'source_tenants' field. If this flag

pkg/mimir/modules.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -859,7 +859,7 @@ func (t *Mimir) initRuler() (serv services.Service, err error) {
859859
if err != nil {
860860
return nil, err
861861
}
862-
remoteQuerier := ruler.NewRemoteQuerier(queryFrontendClient, t.Cfg.Querier.EngineConfig.Timeout, t.Cfg.Ruler.QueryFrontend.QueryResultResponseFormat, t.Cfg.API.PrometheusHTTPPrefix, util_log.Logger, ruler.WithOrgIDMiddleware)
862+
remoteQuerier := ruler.NewRemoteQuerier(queryFrontendClient, t.Cfg.Querier.EngineConfig.Timeout, 1, t.Cfg.Ruler.QueryFrontend.QueryResultResponseFormat, t.Cfg.API.PrometheusHTTPPrefix, util_log.Logger, ruler.WithOrgIDMiddleware)
863863

864864
embeddedQueryable = prom_remote.NewSampleAndChunkQueryableClient(
865865
remoteQuerier,

pkg/ruler/remotequerier.go

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ import (
3131
"github.com/prometheus/prometheus/promql"
3232
"github.com/prometheus/prometheus/storage"
3333
"github.com/prometheus/prometheus/storage/remote"
34+
"golang.org/x/time/rate"
3435
"google.golang.org/grpc"
3536
"google.golang.org/grpc/codes"
3637

@@ -67,6 +68,8 @@ type QueryFrontendConfig struct {
6768
GRPCClientConfig grpcclient.Config `yaml:"grpc_client_config" doc:"description=Configures the gRPC client used to communicate between the rulers and query-frontends."`
6869

6970
QueryResultResponseFormat string `yaml:"query_result_response_format"`
71+
72+
MaxRetriesRate float64 `yaml:"max_retries_rate"`
7073
}
7174

7275
func (c *QueryFrontendConfig) RegisterFlags(f *flag.FlagSet) {
@@ -80,6 +83,7 @@ func (c *QueryFrontendConfig) RegisterFlags(f *flag.FlagSet) {
8083
c.GRPCClientConfig.RegisterFlagsWithPrefix("ruler.query-frontend.grpc-client-config", f)
8184

8285
f.StringVar(&c.QueryResultResponseFormat, "ruler.query-frontend.query-result-response-format", formatProtobuf, fmt.Sprintf("Format to use when retrieving query results from query-frontends. Supported values: %s", strings.Join(allFormats, ", ")))
86+
f.Float64Var(&c.MaxRetriesRate, "ruler.query-frontend.max-retries-rate", 170, "Maximum number of retries for failed queries per second.")
8387
}
8488

8589
func (c *QueryFrontendConfig) Validate() error {
@@ -115,6 +119,7 @@ type Middleware func(ctx context.Context, req *httpgrpc.HTTPRequest) error
115119
// RemoteQuerier executes read operations against a httpgrpc.HTTPClient.
116120
type RemoteQuerier struct {
117121
client httpgrpc.HTTPClient
122+
retryLimiter *rate.Limiter
118123
timeout time.Duration
119124
middlewares []Middleware
120125
promHTTPPrefix string
@@ -130,6 +135,7 @@ var protobufDecoderInstance = protobufDecoder{}
130135
func NewRemoteQuerier(
131136
client httpgrpc.HTTPClient,
132137
timeout time.Duration,
138+
maxRetryRate float64, // maxRetryRate is the maximum number of retries for failed queries per second.
133139
preferredQueryResultResponseFormat string,
134140
prometheusHTTPPrefix string,
135141
logger log.Logger,
@@ -138,6 +144,7 @@ func NewRemoteQuerier(
138144
return &RemoteQuerier{
139145
client: client,
140146
timeout: timeout,
147+
retryLimiter: rate.NewLimiter(rate.Limit(maxRetryRate), 1),
141148
middlewares: middlewares,
142149
promHTTPPrefix: prometheusHTTPPrefix,
143150
logger: logger,
@@ -349,12 +356,22 @@ func (q *RemoteQuerier) sendRequest(ctx context.Context, req *httpgrpc.HTTPReque
349356
if !retry.Ongoing() {
350357
return nil, err
351358
}
352-
level.Warn(logger).Log("msg", "failed to remotely evaluate query expression, will retry", "err", err)
353-
retry.Wait()
354359

355-
// Avoid masking last known error if context was cancelled while waiting.
356-
if ctx.Err() != nil {
357-
return nil, fmt.Errorf("%s while retrying request, last err was: %w", ctx.Err(), err)
360+
retryReservation := q.retryLimiter.Reserve()
361+
if !retryReservation.OK() {
362+
// This should only happen if we've misconfigured the limiter.
363+
return nil, fmt.Errorf("couldn't reserve a retry token")
364+
}
365+
// We want to wait at least the time for the backoff, but also don't want to exceed the rate limit.
366+
// All of this is capped to the max backoff, so that we are less likely to overrun into the next evaluation.
367+
retryDelay := max(retry.NextDelay(), min(retryConfig.MaxBackoff, retryReservation.Delay()))
368+
level.Warn(logger).Log("msg", "failed to remotely evaluate query expression, will retry", "err", err, "retry_delay", retryDelay)
369+
select {
370+
case <-time.After(retryDelay):
371+
case <-ctx.Done():
372+
retryReservation.Cancel()
373+
// Avoid masking last known error if context was cancelled while waiting.
374+
return nil, fmt.Errorf("%s while retrying request, last error was: %w", ctx.Err(), err)
358375
}
359376
}
360377
}

pkg/ruler/remotequerier_test.go

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ func TestRemoteQuerier_Read(t *testing.T) {
6464
t.Run("should issue a remote read request", func(t *testing.T) {
6565
client, inReq := setup()
6666

67-
q := NewRemoteQuerier(client, time.Minute, formatJSON, "/prometheus", log.NewNopLogger())
67+
q := NewRemoteQuerier(client, time.Minute, 1, formatJSON, "/prometheus", log.NewNopLogger())
6868
_, err := q.Read(context.Background(), &prompb.Query{}, false)
6969
require.NoError(t, err)
7070

@@ -76,7 +76,7 @@ func TestRemoteQuerier_Read(t *testing.T) {
7676
t.Run("should not inject the read consistency header if none is defined in the context", func(t *testing.T) {
7777
client, inReq := setup()
7878

79-
q := NewRemoteQuerier(client, time.Minute, formatJSON, "/prometheus", log.NewNopLogger())
79+
q := NewRemoteQuerier(client, time.Minute, 1, formatJSON, "/prometheus", log.NewNopLogger())
8080
_, err := q.Read(context.Background(), &prompb.Query{}, false)
8181
require.NoError(t, err)
8282

@@ -86,7 +86,7 @@ func TestRemoteQuerier_Read(t *testing.T) {
8686
t.Run("should inject the read consistency header if it is defined in the context", func(t *testing.T) {
8787
client, inReq := setup()
8888

89-
q := NewRemoteQuerier(client, time.Minute, formatJSON, "/prometheus", log.NewNopLogger())
89+
q := NewRemoteQuerier(client, time.Minute, 1, formatJSON, "/prometheus", log.NewNopLogger())
9090

9191
ctx := api.ContextWithReadConsistencyLevel(context.Background(), api.ReadConsistencyStrong)
9292
_, err := q.Read(ctx, &prompb.Query{}, false)
@@ -101,7 +101,7 @@ func TestRemoteQuerier_ReadReqTimeout(t *testing.T) {
101101
<-ctx.Done()
102102
return nil, ctx.Err()
103103
}
104-
q := NewRemoteQuerier(mockHTTPGRPCClient(mockClientFn), time.Second, formatJSON, "/prometheus", log.NewNopLogger())
104+
q := NewRemoteQuerier(mockHTTPGRPCClient(mockClientFn), time.Second, 1, formatJSON, "/prometheus", log.NewNopLogger())
105105

106106
_, err := q.Read(context.Background(), &prompb.Query{}, false)
107107
require.Error(t, err)
@@ -139,7 +139,7 @@ func TestRemoteQuerier_Query(t *testing.T) {
139139
t.Run(fmt.Sprintf("format = %s", format), func(t *testing.T) {
140140
client, inReq := setup()
141141

142-
q := NewRemoteQuerier(client, time.Minute, format, "/prometheus", log.NewNopLogger())
142+
q := NewRemoteQuerier(client, time.Minute, 1, format, "/prometheus", log.NewNopLogger())
143143
_, err := q.Query(context.Background(), "qs", tm)
144144
require.NoError(t, err)
145145

@@ -165,7 +165,7 @@ func TestRemoteQuerier_Query(t *testing.T) {
165165
t.Run("should not inject the read consistency header if none is defined in the context", func(t *testing.T) {
166166
client, inReq := setup()
167167

168-
q := NewRemoteQuerier(client, time.Minute, formatJSON, "/prometheus", log.NewNopLogger())
168+
q := NewRemoteQuerier(client, time.Minute, 1, formatJSON, "/prometheus", log.NewNopLogger())
169169
_, err := q.Query(context.Background(), "qs", tm)
170170
require.NoError(t, err)
171171

@@ -175,7 +175,7 @@ func TestRemoteQuerier_Query(t *testing.T) {
175175
t.Run("should inject the read consistency header if it is defined in the context", func(t *testing.T) {
176176
client, inReq := setup()
177177

178-
q := NewRemoteQuerier(client, time.Minute, formatJSON, "/prometheus", log.NewNopLogger())
178+
q := NewRemoteQuerier(client, time.Minute, 1, formatJSON, "/prometheus", log.NewNopLogger())
179179

180180
ctx := api.ContextWithReadConsistencyLevel(context.Background(), api.ReadConsistencyStrong)
181181
_, err := q.Query(ctx, "qs", tm)
@@ -276,20 +276,20 @@ func TestRemoteQuerier_QueryRetryOnFailure(t *testing.T) {
276276
}
277277
return testCase.response, nil
278278
}
279-
q := NewRemoteQuerier(mockHTTPGRPCClient(mockClientFn), time.Minute, formatJSON, "/prometheus", log.NewNopLogger())
279+
q := NewRemoteQuerier(mockHTTPGRPCClient(mockClientFn), time.Minute, 1, formatJSON, "/prometheus", log.NewNopLogger())
280280
require.Equal(t, int64(0), count.Load())
281281
_, err := q.Query(ctx, "qs", time.Now())
282282
if testCase.err == nil {
283283
if testCase.expectedError == nil {
284284
require.NoError(t, err)
285285
} else {
286286
require.Error(t, err)
287-
require.EqualError(t, err, testCase.expectedError.Error())
287+
require.ErrorContains(t, err, testCase.expectedError.Error())
288288
}
289289
require.Equal(t, int64(1), count.Load())
290290
} else {
291291
require.Error(t, err)
292-
require.EqualError(t, err, testCase.expectedError.Error())
292+
require.ErrorContains(t, err, testCase.expectedError.Error())
293293
if testCase.expectedRetries {
294294
require.Greater(t, count.Load(), int64(1))
295295
} else {
@@ -405,7 +405,7 @@ func TestRemoteQuerier_QueryJSONDecoding(t *testing.T) {
405405
Body: []byte(scenario.body),
406406
}, nil
407407
}
408-
q := NewRemoteQuerier(mockHTTPGRPCClient(mockClientFn), time.Minute, formatJSON, "/prometheus", log.NewNopLogger())
408+
q := NewRemoteQuerier(mockHTTPGRPCClient(mockClientFn), time.Minute, 1, formatJSON, "/prometheus", log.NewNopLogger())
409409

410410
tm := time.Unix(1649092025, 515834)
411411
actual, err := q.Query(context.Background(), "qs", tm)
@@ -678,7 +678,7 @@ func TestRemoteQuerier_QueryProtobufDecoding(t *testing.T) {
678678
Body: b,
679679
}, nil
680680
}
681-
q := NewRemoteQuerier(mockHTTPGRPCClient(mockClientFn), time.Minute, formatProtobuf, "/prometheus", log.NewNopLogger())
681+
q := NewRemoteQuerier(mockHTTPGRPCClient(mockClientFn), time.Minute, 1, formatProtobuf, "/prometheus", log.NewNopLogger())
682682

683683
tm := time.Unix(1649092025, 515834)
684684
actual, err := q.Query(context.Background(), "qs", tm)
@@ -701,7 +701,7 @@ func TestRemoteQuerier_QueryUnknownResponseContentType(t *testing.T) {
701701
Body: []byte("some body content"),
702702
}, nil
703703
}
704-
q := NewRemoteQuerier(mockHTTPGRPCClient(mockClientFn), time.Minute, formatJSON, "/prometheus", log.NewNopLogger())
704+
q := NewRemoteQuerier(mockHTTPGRPCClient(mockClientFn), time.Minute, 1, formatJSON, "/prometheus", log.NewNopLogger())
705705

706706
tm := time.Unix(1649092025, 515834)
707707
_, err := q.Query(context.Background(), "qs", tm)
@@ -713,7 +713,7 @@ func TestRemoteQuerier_QueryReqTimeout(t *testing.T) {
713713
<-ctx.Done()
714714
return nil, ctx.Err()
715715
}
716-
q := NewRemoteQuerier(mockHTTPGRPCClient(mockClientFn), time.Second, formatJSON, "/prometheus", log.NewNopLogger())
716+
q := NewRemoteQuerier(mockHTTPGRPCClient(mockClientFn), time.Second, 1, formatJSON, "/prometheus", log.NewNopLogger())
717717

718718
tm := time.Unix(1649092025, 515834)
719719
_, err := q.Query(context.Background(), "qs", tm)
@@ -771,7 +771,7 @@ func TestRemoteQuerier_StatusErrorResponses(t *testing.T) {
771771
return testCase.resp, testCase.err
772772
}
773773
logger := newLoggerWithCounter()
774-
q := NewRemoteQuerier(mockHTTPGRPCClient(mockClientFn), time.Minute, formatJSON, "/prometheus", logger)
774+
q := NewRemoteQuerier(mockHTTPGRPCClient(mockClientFn), time.Minute, 1, formatJSON, "/prometheus", logger)
775775

776776
tm := time.Unix(1649092025, 515834)
777777

0 commit comments

Comments
 (0)