You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ruler: cap the number of remote eval retries (#10375)
* ruler: cap the number of remote eval retries
The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that.
This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%.
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
* Add CHANGELOG.md entry
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
* Fix a totally arbitrary stupid linter rule
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
* Use a CB instead of a rate limtier
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
* Revert "Use a CB instead of a rate limtier"
This reverts commit b07366f.
* Don't abort retries if we're over the rate limit
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
* Cancel reservation when context expires
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
---------
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Copy file name to clipboardExpand all lines: CHANGELOG.md
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -2,12 +2,12 @@
2
2
3
3
## main / unreleased
4
4
5
-
*[CHANGE] Query-frontend: Add `topic` label to `cortex_ingest_storage_strong_consistency_requests_total`, `cortex_ingest_storage_strong_consistency_failures_total`, and `cortex_ingest_storage_strong_consistency_wait_duration_seconds` metrics. #10220
6
-
7
5
### Grafana Mimir
8
6
9
7
*[CHANGE] Distributor: OTLP and push handler replace all non-UTF8 characters with the unicode replacement character `\uFFFD` in error messages before propagating them. #10236
10
8
*[CHANGE] Querier: pass query matchers to queryable `IsApplicable` hook. #10256
9
+
*[CHANGE] Query-frontend: Add `topic` label to `cortex_ingest_storage_strong_consistency_requests_total`, `cortex_ingest_storage_strong_consistency_failures_total`, and `cortex_ingest_storage_strong_consistency_wait_duration_seconds` metrics. #10220
10
+
*[CHANGE] Ruler: cap the rate of retries for remote query evaluation to 170/sec. This is configurable via `-ruler.query-frontend.max-retries-rate`. #10375
*[ENHANCEMENT] Distributor: OTLP receiver now converts also metric metadata. See also https://github.com/prometheus/prometheus/pull/15416. #10168
13
13
*[ENHANCEMENT] Distributor: discard float and histogram samples with duplicated timestamps from each timeseries in a request before the request is forwarded to ingesters. Discarded samples are tracked by the `cortex_discarded_samples_total` metrics with reason `sample_duplicate_timestamp`. #10145
Copy file name to clipboardExpand all lines: cmd/mimir/help.txt.tmpl
+2Lines changed: 2 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -737,6 +737,8 @@ Usage of ./cmd/mimir/mimir:
737
737
Maximum number of rules per rule group by namespace. Value is a map, where each key is the namespace and value is the number of rules allowed in the namespace (int). On the command line, this map is given in a JSON format. The number of rules specified has the same meaning as -ruler.max-rules-per-rule-group, but only applies for the specific namespace. If specified, it supersedes -ruler.max-rules-per-rule-group. (default {})
738
738
-ruler.query-frontend.address string
739
739
GRPC listen address of the query-frontend(s). Must be a DNS address (prefixed with dns:///) to enable client side load balancing.
740
+
-ruler.query-frontend.max-retries-rate float
741
+
Maximum number of retries for failed queries per second. (default 170)
Copy file name to clipboardExpand all lines: pkg/ruler/remotequerier.go
+22-5Lines changed: 22 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -31,6 +31,7 @@ import (
31
31
"github.com/prometheus/prometheus/promql"
32
32
"github.com/prometheus/prometheus/storage"
33
33
"github.com/prometheus/prometheus/storage/remote"
34
+
"golang.org/x/time/rate"
34
35
"google.golang.org/grpc"
35
36
"google.golang.org/grpc/codes"
36
37
@@ -67,6 +68,8 @@ type QueryFrontendConfig struct {
67
68
GRPCClientConfig grpcclient.Config`yaml:"grpc_client_config" doc:"description=Configures the gRPC client used to communicate between the rulers and query-frontends."`
f.StringVar(&c.QueryResultResponseFormat, "ruler.query-frontend.query-result-response-format", formatProtobuf, fmt.Sprintf("Format to use when retrieving query results from query-frontends. Supported values: %s", strings.Join(allFormats, ", ")))
86
+
f.Float64Var(&c.MaxRetriesRate, "ruler.query-frontend.max-retries-rate", 170, "Maximum number of retries for failed queries per second.")
0 commit comments