update Mimir / Reads and Remote Ruler Reads dashboard with query sche…

…duler metrics (#10290)
grafana · Jan 3, 2025 · f5a03c1 · f5a03c1
1 parent 27a7389
commit f5a03c1
Show file tree

Hide file tree

Showing 8 changed files with 716 additions and 234 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,6 +14,7 @@
 * [ENHANCEMENT] Ruler: Add `cortex_prometheus_rule_group_last_rule_duration_sum_seconds` metric to track the total evaluation duration of a rule group regardless of concurrency #10189
 * [ENHANCEMENT] Distributor: Add native histogram support for `electedReplicaPropagationTime` metric in ha_tracker. #10264
 * [ENHANCEMENT] Ingester: More efficient CPU/memory utilization-based read request limiting. #10325
+* [ENHANCEMENT] Dashboards: Add Query-Scheduler <-> Querier Inflight Requests row to Query Reads and Remote Ruler reads dashboards. #10290
 * [BUGFIX] Distributor: Use a boolean to track changes while merging the ReplicaDesc components, rather than comparing the objects directly. #10185
 * [BUGFIX] Querier: fix timeout responding to query-frontend when response size is very close to `-querier.frontend-client.grpc-max-send-msg-size`. #10154
 * [BUGFIX] Query-frontend and querier: show warning/info annotations in some cases where they were missing (if a lazy querier was used). #10277

diff --git a/docs/sources/mimir/manage/mimir-runbooks/_index.md b/docs/sources/mimir/manage/mimir-runbooks/_index.md
@@ -775,17 +775,48 @@ The procedure to investigate it is the same as the one for [`MimirSchedulerQueri
 
 This alert fires if queries are piling up in the query-scheduler.
 
-The size of the queue is shown on the `Queue length` dashboard panel on the `Mimir / Reads` (for the standard query path) or `Mimir / Remote Ruler Reads`
+#### Dashboard Panels
+
+The size of the queue is shown on the `Queue Length` dashboard panel on the [`Mimir / Reads`](https://admin-ops-eu-south-0.grafana-ops.net/grafana/d/e327503188913dc38ad571c647eef643) (for the standard query path) or `Mimir / Remote Ruler Reads`
 (for the dedicated rule evaluation query path) dashboards.
 
-How it **works**:
+The `Queue Length` dashboard panel on the `Mimir / Reads` (for the standard query path)
+and `Mimir / Remote Ruler Reads` (for the dedicated rule evaluation query path) dashboards shows the queue size.
+The `Latency (Time in Queue)` is broken out in the dashboard row below by the "Expected Query Component" -
+the scheduler queue itself is partitioned by the Expected Query Component for each query,
+which is an estimate from the query time range of which component the querier utilizes to fetch data
 
-- A query-frontend API endpoint is called to execute a query
-- The query-frontend enqueues the request to the query-scheduler
-- The query-scheduler is responsible for dispatching enqueued queries to idle querier workers
-- The querier runs the query, sends the response back directly to the query-frontend and notifies the query-scheduler that it can process another query
+The row below shows peak values for `Query-scheduler <-> Querier Inflight Requests`, also broken out by query component.
+This shows when the queriers are saturated with inflight query requests,
+as well as which query components are utilized to service the queries.
 
-How to **investigate**:
+#### How it Works
+
+- A query-frontend API endpoint is called to execute a query.
+- The query-frontend enqueues the request to the query-scheduler.
+- The query-scheduler is responsible for dispatching enqueued queries to idle querier workers.
+- The querier fetches data from ingesters, store-gateways, or both, and runs the query against the data.
+  Then, it sends the response back directly to the query-frontend and notifies the query-scheduler that it can process another query.
+
+#### How to Investigate
+
+Note that elevated measures of _inflight_ queries at any part of the read path are likely a symptom and not a cause.
+
+**Ingester or Store-Gateway Issues**
+
+With querier autoscaling in place, the most common cause of a query backlog is that either ingesters or store-gateways
+are not able to keep up with their query load.
+
+Investigate the RPS and Latency panels for ingesters and store-gateways on the `Mimir / Reads` dashboard
+and compare this value to the `Latency (Time in Queue)` or `Query-scheduler <-> Querier Inflight Requests`
+breakouts on the `Mimir / Reads` or `Mimir / Remote Ruler Reads` dashboards.
+Additionally, check the `Mimir / Reads Resources` dashboard for elevated resource utilization or limiting on ingesters or store-gateways.
+
+Generally, this shows that one of either the ingesters or store-gateways is experiencing issues
+and then you can further investigate the query component on its own.
+Scaling up queriers is unlikely to help in this case, as it places more load on an already-overloaded component.
+
+**Querier Issues**
 
 - Are queriers in a crash loop (eg. OOMKilled)?
   - `OOMKilled`: temporarily increase queriers memory request/limit
@@ -799,7 +830,7 @@ How to **investigate**:
   - Check if a specific tenant is running heavy queries
     - Run `sum by (user) (cortex_query_scheduler_queue_length{namespace="<namespace>"}) > 0` to find tenants with enqueued queries
     - If remote ruler evaluation is enabled, make sure you understand which one of the read paths (user or ruler queries?) is being affected - check the alert message.
-    - Check the `Mimir / Slow Queries` dashboard to find slow queries
+    - Check the [Mimir / Slow Queries` dashboard to find slow queries
   - On multi-tenant Mimir cluster with **shuffle-sharing for queriers disabled**, you may consider to enable it for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant).
   - On multi-tenant Mimir cluster with **shuffle-sharding for queriers enabled**, you may consider to temporarily increase the shard size for affected tenants: be aware that this could affect other tenants too, reducing resources available to run other tenant queries. Alternatively, you may choose to do nothing and let Mimir return errors for that given user once the per-tenant queue is full.
   - On multi-tenant Mimir clusters with **query-sharding enabled** and **more than a few tenants** being affected: The workload exceeds the available downstream capacity. Scaling of queriers and potentially store-gateways should be considered.
@@ -808,6 +839,15 @@ How to **investigate**:
     - Otherwise and only if the queries by the tenant are within reason representing normal usage, consider scaling of queriers and potentially store-gateways.
   - On a Mimir cluster with **querier auto-scaling enabled** after checking the health of the existing querier replicas, check to see if the auto-scaler has added additional querier replicas or if the maximum number of querier replicas has been reached and is not sufficient and should be increased.
 
+**Query-Scheduler Issues**
+
+In rare cases, the query-scheduler itself may be the bottleneck.
+When querier-connection utilization is low in the `Query-scheduler <-> Querier Inflight Requests` dashboard panels
+but the queue length or latency is high, it indicates that the query-scheduler is very slow in dispatching queries.
+
+In this case if the scheduler is not resource-constrained,
+you can use CPU profiles to see where the scheduler's query dispatch process is spending its time.
+
 ### MimirCacheRequestErrors
 
 This alert fires if the Mimir cache client is experiencing a high error rate for a specific cache and operation.