Mimir components stopped working after upgrade of Helm Charts to version 5.2.0 #7240

abanfi-nozomi · 2024-01-29T07:50:48Z

abanfi-nozomi
Jan 29, 2024

Hi all,

for some time now we have been installing in our K8s on EKS clusters a full LGTM stack, with all components deployed as Helm Chart.
After an upgrade of Tempo to version 1.8.2 from 1.8.1 and Mimir to version 5.2.1 from 5.2.0, the read and write capability of the Mimir deployment was completely corrupted.

The distributors were attempting to send metrics to pods that were unavailable, throwing errors such as user=anonymous msg="push error\" err="at least 2 live replicas are required, only 1 could be found--unhealthy instances: xxx:xxx: xxx:xxx:9095,xxx:xxx:xxx:9095\"", while all Grafana panels showed an error such as "expanding series: too many unhealthy instances in the ring (internal: rpc error: code = Code(500)`, coming from mimir-query-frontend.

The Mimir distributors were up and running, but the distributors and querier seemed to be looking for endpoints that did not exist.
We tried reducing all Mimir deployments and statefulsets to zero replication, to no avail.
We tried updating the membertlist at the distributor level, but still nothing.

The solution, to get back to a fully functioning distribution, was to uninstall and reinstall Mimir chart helm from scratch.
We wondered if anyone else had had the same problem and how it could be solved without having to do a new installation.

dimitarvdimitrov · 2024-02-06T15:33:58Z

dimitarvdimitrov
Feb 6, 2024
Maintainer

did you try rolling back to 5.2.0?

4 replies

abanfi-nozomi Feb 8, 2024
Author

Hi,

yes we rolled back to 5.2.0 version but the problem was there yet.
In order to resolve, we had to uninstall and restart the helm chart from scratch.

Our feeling, but considering we cannot replicate it I'm not 100% sure on it, is that the member-list has not been updated after the upgrade and new pods on reading and writing pipelines were trying to reach instances which were there anymore.
Consider that we have 5 clusters in 5 different regions, we performed the same operation of upgrade via terraform on all the clusters and just 3 of them were affected by the problem.

dimitarvdimitrov Feb 8, 2024
Maintainer

the change in 5.2.1 was minimal and it should affect the ring or memberlist (it was a config change on the S3 bucket config)

what I suspect happened was uncoordinated shutdowns the ingester pods. Do you have your values file handy for the clusters that saw this outage? If it's too large to paste here, then I think the mimir section would be most relevant

abanfi-nozomi Feb 12, 2024
Author

Hi, here the mimir values file:

# Based on the `small.yaml` config
# https://raw.githubusercontent.com/grafana/mimir/main/operations/helm/charts/mimir-distributed/small.yaml

metaMonitoring:
  dashboards:
    enabled: true
    annotations:
      k8s-sidecar-target-directory: /tmp/dashboards/MetaMonitoring
      folder: MetaMonitoring
    labels:
      grafana_dashboard: "managed"
  serviceMonitor:
    enabled: true
  grafanaAgent:
    enabled: true
    installOperator: false
    logs:
      enabled: false
    scrapeK8s:
      # -- When grafanaAgent.enabled and serviceMonitor.enabled, controls whether to create ServiceMonitors CRs
      # for cadvisor, kubelet, and kube-state-metrics. The scraped metrics are reduced to those pertaining to
      # Mimir pods only.
      enabled: false
compactor:
  persistentVolume:
    enabled: true
    size: 20Gi
    storageClass: ${STORAGE_CLASS}
  resources:
    limits:
      memory: 2Gi
    requests:
      cpu: 100m
      memory: 512Mi
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nws/workload.type
            operator: In
            values:
            - monitoring
  tolerations:
  - key: "nws/workload.type"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"
distributor:
  replicas: 2
  resources:
    limits:
      memory: 4Gi
    requests:
      cpu: 100m
      memory: 512Mi
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nws/workload.type
            operator: In
            values:
            - monitoring
  tolerations:
  - key: "nws/workload.type"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"
ingester:
  persistentVolume:
    enabled: true
    size: 50Gi
    storageClass: ${STORAGE_CLASS}
  replicas: ${INGESTER_REPLICAS}
  resources:
    limits:
      memory: 6Gi
    requests:
      cpu: 100m
      memory: 512Mi
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nws/workload.type
            operator: In
            values:
            - monitoring
  tolerations:
  - key: "nws/workload.type"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"
  zoneAwareReplication:
    enabled: false
chunks-cache:
  enabled: true
  replicas: 2
  allocatedMemory: 2048
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/component: chunks-cache
              app.kubernetes.io/instance: mimir
          topologyKey: kubernetes.io/hostname
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nws/workload.type
            operator: In
            values:
            - monitoring
  tolerations:
  - key: "nws/workload.type"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"
index-cache:
  enabled: true
  replicas: 2
  allocatedMemory: 1024
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nws/workload.type
            operator: In
            values:
            - monitoring
  tolerations:
  - key: "nws/workload.type"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"
metadata-cache:
  enabled: true
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nws/workload.type
            operator: In
            values:
            - monitoring
  tolerations:
  - key: "nws/workload.type"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"
results-cache:
  enabled: true
  replicas: 2
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nws/workload.type
            operator: In
            values:
            - monitoring
  tolerations:
  - key: "nws/workload.type"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"
minio:
  enabled: false

overrides_exporter:
  replicas: 1
  resources:
    limits:
      memory: 128Mi
    requests:
      cpu: 100m
      memory: 128Mi
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nws/workload.type
            operator: In
            values:
            - monitoring
  tolerations:
  - key: "nws/workload.type"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"
querier:
  replicas: 2
  resources:
    limits:
      memory: 4Gi
    requests:
      cpu: 100m
      memory: 128Mi
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nws/workload.type
            operator: In
            values:
            - monitoring
  tolerations:
  - key: "nws/workload.type"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"
query_frontend:
  replicas: 1
  resources:
    limits:
      memory: 2Gi
    requests:
      cpu: 100m
      memory: 128Mi
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nws/workload.type
            operator: In
            values:
            - monitoring
  tolerations:
  - key: "nws/workload.type"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"
store_gateway:
  persistentVolume:
    enabled: true
    size: 10Gi
    storageClass: ${STORAGE_CLASS}
  replicas: 3
  resources:
    limits:
      memory: 2Gi
    requests:
      cpu: 100m
      memory: 512Mi
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nws/workload.type
            operator: In
            values:
            - monitoring
  zoneAwareReplication:
    enabled: false
  tolerations:
  - key: "nws/workload.type"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"
nginx:
  service:
    annotations:
      "prometheus.io/probe": "true"
  nginxConfig:
    httpSnippet:
      "client_max_body_size 10m;"
  replicas: 2
  resources:
    limits:
      memory: 512Mi
    requests:
      cpu: 100m
      memory: 128Mi
  affinity: |
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nws/workload.type
            operator: In
            values:
            - monitoring
  tolerations:
  - key: "nws/workload.type"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"
# Grafana Enterprise Metrics feature related
admin_api:
  replicas: 1
  resources:
    limits:
      memory: 128Mi
    requests:
      cpu: 100m
      memory: 64Mi
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nws/workload.type
            operator: In
            values:
            - monitoring
  tolerations:
  - key: "nws/workload.type"
    operator: "Equal"
    value: "monitoring"
    effect: "NoSchedule"

serviceAccount:
  create: false
  name: ${K8S_SERVICEACCOUNT_NAME}

mimir:
  structuredConfig:
    common:
      storage:
        backend: s3
        s3:
          endpoint: s3.${BLOCK_BUCKET_REGION}.amazonaws.com
          region: ${BLOCK_BUCKET_REGION}
    blocks_storage:
      s3:
        bucket_name: ${BLOCK_BUCKET_NAME}
      storage_prefix: blocks
      bucket_store:
        chunks_cache:
          backend: memcached
          memcached:
            timeout: 4000ms
      tsdb:
        head_compaction_concurrency: 2
    alertmanager_storage:
      s3:
        bucket_name: ${BLOCK_BUCKET_NAME}-alertmanager
    ruler_storage:
      s3:
        bucket_name: ${BLOCK_BUCKET_NAME}-ruler
    limits:
      #Increase for Asset Mgnt SRE-692
      max_global_series_per_user: 4000000
      ingestion_rate: 100000
      ingestion_burst_size: 200000
      # Needed for Linkerd
      max_label_names_per_series: 40
      # Needed for Assets enrichment
      max_label_value_length: 20480
    frontend:
      results_cache:
        backend: memcached
        memcached:
          timeout: 10000ms
    compactor:
      compaction_concurrency: 2
      compaction_interval: 30m
rollout_operator:
  enabled: false

Regarding uncoordinated shutdowns of ingester pods, should the gossip algorithm recreate the list looking for the new instances?

dimitarvdimitrov Feb 12, 2024
Maintainer

nothing strikes me as the cause of this. Can you perhaps check in on how the rollout proceeded? It should be pod-by-pod according to the StatefulSet update strategy. You can run these two queries to see how many ingesters were restarted at a time around the time of the outage

kube_statefulset_status_replicas_ready{statefulset="ingester"}

kube_statefulset_status_replicas{statefulset="ingester"}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mimir components stopped working after upgrade of Helm Charts to version 5.2.0 #7240

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Mimir components stopped working after upgrade of Helm Charts to version 5.2.0 #7240

abanfi-nozomi Jan 29, 2024

Replies: 1 comment · 4 replies

dimitarvdimitrov Feb 6, 2024 Maintainer

abanfi-nozomi Feb 8, 2024 Author

dimitarvdimitrov Feb 8, 2024 Maintainer

abanfi-nozomi Feb 12, 2024 Author

dimitarvdimitrov Feb 12, 2024 Maintainer

abanfi-nozomi
Jan 29, 2024

Replies: 1 comment 4 replies

dimitarvdimitrov
Feb 6, 2024
Maintainer

abanfi-nozomi Feb 8, 2024
Author

dimitarvdimitrov Feb 8, 2024
Maintainer

abanfi-nozomi Feb 12, 2024
Author

dimitarvdimitrov Feb 12, 2024
Maintainer