[BUG] Cortex v1.18.0 Upgrade Causing OOMKills and CPU Spikes in Store-Gateway #6259

dpericaxon · 2024-10-10T18:23:43Z

Describe the bug
Following the upgrade of Cortex from v1.17.1 to v1.18.0, the Store Gateway Pods are frequently encountering OOMKills. These events appear random, occurring approximately every 5 minutes, and have continued beyond the upgrade. Before the upgrade, memory usage consistently hovered around 4GB, with CPU usage under 1 core. However, after the upgrade, both CPU and memory usage have spiked to over 10 times their typical levels. Even after increasing the memory limit for the Store Gateway to 30GB, the issue persists. (see graph below)

We initially suspected the issue might be related to the sharding ring configurations, so we attempted to disable the following flags:

store-gateway.sharding-ring.zone-awareness-enabled=False
store-gateway.sharding-ring.zone-stable-shuffle-sharding=False
However, this did not resolve the problem.

CPU Graph: The far left shows usage before the upgrade, the middle represents usage during the upgrade, and the far right illustrates the rollback, where CPU usage returns to normal levels-

Memory Graph: The far left shows memory usage before the upgrade, the middle represents usage during the upgrade, and the far right reflects the rollback, where memory usage returns to normal levels-

To Reproduce
Steps to reproduce the behavior:

Upgrade to Cortex v1.18.0 from v1.17.1 using the Cortex Helm Chart with the values in the Additional Context section.

Expected behavior
Store-GW shouldn't be OOMKilling.

Environment:

Infrastructure: AKS(Kubernetes)
Deployment tool: Cortex Helm Chart v2.3.0 or v2.4.0

Additional Context

Helm Chart Values Passed

        useExternalConfig: true
        image:
          repository: redact
          tag: v1.18.0
        externalConfigVersion: x
        ingress:
          enabled: true
          ingressClass:
            enabled: true
            name: nginx
          hosts:
            - host: cortex.redact
              paths:
                - /
          tls:
            - hosts:
              - cortex.redact
        serviceAccount:
          create: true
          automountServiceAccountToken: true
        store_gateway:
          replicas: 6
          persistentVolume:
            storageClass: premium
            size: 64Gi
          resources:
            resources:
              limits:
                memory: 24Gi
              requests:
                memory: 18Gi
          extraArgs:
            blocks-storage.bucket-store.index-cache.memcached.max-async-buffer-size: "10000000"
            blocks-storage.bucket-store.index-cache.memcached.max-get-multi-concurrency: "100"
            blocks-storage.bucket-store.index-cache.memcached.max-get-multi-batch-size: "100"
            blocks-storage.bucket-store.bucket-index.enabled: true
            blocks-storage.bucket-store.index-header-lazy-loading-enabled: true
            store-gateway.sharding-ring.zone-stable-shuffle-sharding: False
            store-gateway.sharding-ring.zone-awareness-enabled: False
          serviceMonitor:
            enabled: true
            additionalLabels:
              release: kube-prometheus-stack
            relabelings:
              - sourceLabels: [__meta_kubernetes_pod_name]
                targetLabel: instance
        compactor:
          persistentVolume:
            size: 256Gi
            storageClass: premium
          resources:
            limits:
              cpu: 4
              memory: 10Gi
            requests:
              cpu: 1.5
              memory: 5Gi
          serviceMonitor:
            enabled: true
            additionalLabels:
              release: kube-prometheus-stack
            relabelings:
              - sourceLabels: [__meta_kubernetes_pod_name]
                targetLabel: instance
          extraArgs:
            blocks-storage.bucket-store.bucket-index.enabled: true
        nginx:
          replicas: 3
          image:
            repository: redact
            tag: 1.27.2-alpine-slim
          serviceMonitor:
            enabled: true
            additionalLabels:
              release: kube-prometheus-stack
            relabelings:
              - sourceLabels: [__meta_kubernetes_pod_name]
                targetLabel: instance
          resources:
            limits:
              cpu: 500m
              memory: 500Mi
            requests:
              cpu: 500m
              memory: 500Mi
          config:
            verboseLogging: false
        query_frontend:
          replicas: 3
          resources:
            limits:
              cpu: 1
              memory: 5Gi
            requests:
              cpu: 200m
              memory: 4Gi
          serviceMonitor:
            enabled: true
            additionalLabels:
              release: kube-prometheus-stack
            relabelings:
              - sourceLabels: [__meta_kubernetes_pod_name]
                targetLabel: instance
          extraArgs:
            querier.query-ingesters-within: 8h
        querier:
          replicas: 3
          resources:
            limits:
              cpu: 8
              memory: 26Gi
            requests:
              cpu: 1
              memory: 20Gi
          extraArgs:
            querier.query-ingesters-within: 8h
            querier.max-fetched-data-bytes-per-query: "2147483648"
            querier.max-fetched-chunks-per-query: "1000000"
            querier.max-fetched-series-per-query: "200000"
            querier.max-samples: "50000000"
            blocks-storage.bucket-store.bucket-index.enabled: true
          serviceMonitor:
            enabled: true
            additionalLabels:
              release: kube-prometheus-stack
            relabelings:
              - sourceLabels: [__meta_kubernetes_pod_name]
                targetLabel: instance
        ingester:
          statefulSet:
            enabled: true
          replicas: 18
          persistentVolume:
            enabled: true
            size: 64Gi
            storageClass: premium
          resources:
            limits:
              cpu: 8
              memory: 45Gi
            requests:
              cpu: 8
              memory: 40Gi
          extraArgs:
            ingester.max-metadata-per-user: "50000"
            ingester.max-series-per-metric: "200000"
            ingester.instance-limits.max-series: "0"
            ingester.ignore-series-limit-for-metric-names: "redact"
          serviceMonitor:
            enabled: true
            additionalLabels:
              release: kube-prometheus-stack
            relabelings:
              - sourceLabels: [__meta_kubernetes_pod_name]
                targetLabel: instance
        ruler:
          validation:
            enabled: false
          replicas: 3
          resources:
            limits:
              cpu: 2
              memory: 6Gi
            requests:
              cpu: 500m
              memory: 3Gi
          sidecar:
            image:
              repository: redact
              tag: 1.28.0
            resources:
              limits:
                cpu: 1
                memory: 200Mi
              requests:
                cpu: 50m
                memory: 100Mi
            enabled: true
            searchNamespace: cortex-rules
            folder: /tmp/rules
          serviceMonitor:
            enabled: true
            additionalLabels:
              release: kube-prometheus-stack
            relabelings:
              - sourceLabels: [__meta_kubernetes_pod_name]
                targetLabel: instance
          extraArgs:
            blocks-storage.bucket-store.bucket-index.enabled: true
            querier.max-fetched-chunks-per-query: "2000000"
        alertmanager:
          enabled: true
          replicas: 3
          podAnnotations:
            configmap.reloader.stakater.com/reload: "redact"
          statefulSet:
            enabled: true
          persistentVolume:
            size: 8Gi
            storageClass: premium
          sidecar:
            image:
              repository: redact
              tag: 1.28.0
            containerSecurityContext:
              enabled: true
              runAsUser: 0
            resources:
              limits:
                cpu: 100m
                memory: 200Mi
              requests:
                cpu: 50m
                memory: 100Mi
            enabled: true
            searchNamespace: cortex-alertmanager
          serviceMonitor:
            enabled: true
            additionalLabels:
              release: kube-prometheus-stack
            relabelings:
              - sourceLabels: [__meta_kubernetes_pod_name]
                targetLabel: instance
        distributor:
          resources:
            limits:
              cpu: 4
              memory: 10Gi
            requests:
              cpu: 2
              memory: 10Gi
          extraArgs:
            distributor.ingestion-rate-limit: "120000"
            validation.max-label-names-per-series: 40
            distributor.ha-tracker.enable-for-all-users: true
            distributor.ha-tracker.enable: true
            distributor.ha-tracker.failover-timeout: 30s
            distributor.ha-tracker.cluster: "prometheus"
            distributor.ha-tracker.replica: "prometheus_replica"
            distributor.ha-tracker.consul.hostname: consul.cortex:8500
            distributor.instance-limits.max-ingestion-rate: "120000"
          serviceMonitor:
            enabled: true
            additionalLabels:
              release: kube-prometheus-stack
            relabelings:
              - sourceLabels: [__meta_kubernetes_pod_name]
                targetLabel: instance
          autoscaling:
            minReplicas: 15
            maxReplicas: 30
        memcached-frontend:
          enabled: true
          image:
            registry: redact
            repository: redact/memcached-bitnami
            tag: redact
          commonLabels:
            release: kube-prometheus-stack
          podManagementPolicy: OrderedReady
          metrics:
            enabled: true
            image:
              registry: redact
              repository: redact/memcached-exporter-bitnami
              tag: redact
            serviceMonitor:
              enabled: true
              relabelings:
                - sourceLabels: [__meta_kubernetes_pod_name]
                  targetLabel: instance
          resources:
            requests:
              memory: 1Gi
              cpu: 1
            limits:
              memory: 1.5Gi
              cpu: 1
          args:
            - /run.sh
            - -I 32m
          serviceAccount:
            create: true
        memcached-blocks-index:
          enabled: true
          image:
            registry: redact
            repository: redact/memcached-bitnami
            tag: redact
          commonLabels:
            release: kube-prometheus-stack
          podManagementPolicy: OrderedReady
          metrics:
            enabled: true
            image:
              registry: redact
              repository: redact/memcached-exporter-bitnami
              tag: redact
            serviceMonitor:
              enabled: true
              relabelings:
                - sourceLabels: [__meta_kubernetes_pod_name]
                  targetLabel: instance
          resources:
            requests:
              memory: 1Gi
              cpu: 1
            limits:
              memory: 1.5Gi
              cpu: 1.5
          args:
            - /run.sh
            - -I 32m
          serviceAccount:
            create: true
        memcached-blocks:
          enabled: true
          image:
            registry: redact
            repository: redact/memcached-bitnami
            tag: redact
          commonLabels:
            release: kube-prometheus-stack
          podManagementPolicy: OrderedReady
          metrics:
            enabled: true
            image:
              registry: redact
              repository: redact/memcached-exporter-bitnami
              tag: redact
            serviceMonitor:
              enabled: true
              relabelings:
                - sourceLabels: [__meta_kubernetes_pod_name]
                  targetLabel: instance
          resources:
            requests:
              memory: 2Gi
              cpu: 1
            limits:
              memory: 3Gi
              cpu: 1
          args:
            - /run.sh
            - -I 32m
          serviceAccount:
            create: true
        memcached-blocks-metadata:
          enabled: true
          image:
            registry: redact
            repository: redact/memcached-bitnami
            tag: redact
          commonLabels:
            release: kube-prometheus-stack
          podManagementPolicy: OrderedReady
          metrics:
            enabled: true
            image:
              registry: redact
              repository: redact/memcached-exporter-bitnami
              tag: redact
            serviceMonitor:
              enabled: true
              relabelings:
                - sourceLabels: [__meta_kubernetes_pod_name]
                  targetLabel: instance
          resources:
            requests:
              memory: 1Gi
              cpu: 1
            limits:
              memory: 1.5Gi
              cpu: 1
          args:
            - /run.sh
            - -I 32m
          serviceAccount:
            create: true
        runtimeconfigmap:
          create: true
          annotations: {}
          runtime_config: {}

Quick PPROF of Store GW

curl -s http://localhost:8080/debug/pprof/heap > heap.out

go tool pprof heap.out

top

Showing nodes accounting for 622.47MB, 95.80% of 649.78MB total
Dropped 183 nodes (cum <= 3.25MB)
Showing top 10 nodes out of 49
      flat  flat%   sum%        cum   cum%
  365.95MB 56.32% 56.32%   365.95MB 56.32%  github.com/thanos-io/thanos/pkg/block/indexheader.(*BinaryReader).init.func3
  127.94MB 19.69% 76.01%   528.48MB 81.33%  github.com/thanos-io/thanos/pkg/block/indexheader.(*BinaryReader).init
   76.30MB 11.74% 87.75%    76.30MB 11.74%  github.com/thanos-io/thanos/pkg/cacheutil.NewAsyncOperationProcessor
   34.59MB  5.32% 93.07%    34.59MB  5.32%  github.com/prometheus/prometheus/tsdb/index.NewSymbols

CharlieTLe · 2024-10-13T20:45:47Z

Hi @dpericaxon,

Thanks for filing the issue.

I was looking at the pprof attached in the issue and noticed that LabelValues looked like interesting.

github.com/thanos-io/thanos/pkg/block/indexheader.(*LazyBinaryReader).LabelValues

Something that changed in between 1.17.1 and 1.18.0 is this

[CHANGE] Ingester: Remove -querier.query-store-for-labels-enabled flag. Querying long-term store for labels is always enabled. Remove query_store_for_labels_enabled configuration #5984

I don't see this flag being set in your values file for the queriers that enabled it before the upgrade:

            querier.query-ingesters-within: 8h
            querier.max-fetched-data-bytes-per-query: "2147483648"
            querier.max-fetched-chunks-per-query: "1000000"
            querier.max-fetched-series-per-query: "200000"
            querier.max-samples: "50000000"
            blocks-storage.bucket-store.bucket-index.enabled: true

I have a feeling that since it is always enabled, the label values are being returned for the entire time range instead of just the instant that the query was run.

Could you try setting querier.query-store-for-labels-enabled: true in 1.17.1 in your set up and seeing if the issue happens?

alanprot · 2024-10-13T20:52:03Z

It can indeed be because of that flag.. good catch @CharlieTLe

maybe we should default the series/label names apis to query the last 24 hours if the time range is not specified ?

CharlieTLe · 2024-10-13T21:08:59Z

I think we should be able to set a limit for how many label values can be queried so that even if a long time range is specified, it doesn't cause the store-gateway to use too much memory.

alanprot · 2024-10-13T21:13:49Z

There is an effort to limit this but it may not be straight forward as this limit can only be applied after querying the index (and for those particular apis, this is all the work)

CharlieTLe · 2024-10-13T21:37:14Z

Should we add the flag to restore the previous behavior until a limit can be set on the maximum number of label values that could be fetched? Or perhaps setting an execution time limit on the fetching so that it can be cancelled if it's taking longer than a specified duration?

I think this specific API call is mostly used by query builders for making auto complete possible?

yeya24 · 2024-10-13T22:03:54Z

I don't think the heap usage increased was caused by label values request. If you look at the heap profile, it was used by the binary index header part, which is expected as Store Gateway caches blocks' symbols, and some postings. And the heap profile provided may not capture what took memory as it was only 600MBs.

I recommend taking another heap dump from a Store Gateway where you observe high memory usage.

elliesaber · 2024-10-14T20:09:26Z

Thank you @CharlieTLe and @yeya24 for your suggestions

We first tried setting querier.query-store-for-labels-enabled: true in version 1.17.1. After making this change, we observed that the Store Gateway Pods started frequently encountering OOMKills, with both CPU and memory usage spiking far beyond their usual levels.
Since we were able to reproduce the issue with querier.query-store-for-labels-enabled: true, we decided to set it to false and then upgraded to v1.18.0. Unfortunately, even with querier.query-store-for-labels-enabled: false, the Store Gateway Pods continued encountering OOMKills, and CPU and memory usage spiked again.

CPU and memory spike after setting to false and upgrade to v1.18.0

Here’s a quick PPROF of the Store Gateway during one of these OOM incidents:

(pprof) top
Showing nodes accounting for 975.10MB, 95.68% of 1019.09MB total
Dropped 206 nodes (cum <= 5.10MB)
Showing top 10 nodes out of 66
      flat  flat%   sum%        cum   cum%
  464.82MB 45.61% 45.61%   464.82MB 45.61%  github.com/thanos-io/thanos/pkg/block/indexheader.(*BinaryReader).init.func3
  178.29MB 17.50% 63.11%   683.35MB 67.05%  github.com/thanos-io/thanos/pkg/block/indexheader.(*BinaryReader).init
  129.10MB 12.67% 75.78%   129.10MB 12.67%  github.com/thanos-io/thanos/pkg/pool.NewBucketedBytes.func1
   76.84MB  7.54% 83.32%    76.84MB  7.54%  github.com/thanos-io/thanos/pkg/cacheutil.NewAsyncOperationProcessor
   64.77MB  6.36% 89.67%    65.77MB  6.45%  github.com/bradfitz/gomemcache/memcache.parseGetResponse
   40.23MB  3.95% 93.62%    40.23MB  3.95%  github.com/prometheus/prometheus/tsdb/index.NewSymbols
   13.94MB  1.37% 94.99%    13.94MB  1.37%  github.com/klauspost/compress/s2.NewWriter.func1
    4.10MB   0.4% 95.39%   687.45MB 67.46%  github.com/thanos-io/thanos/pkg/block/indexheader.newFileBinaryReader
    1.50MB  0.15% 95.54%     5.55MB  0.54%  github.com/thanos-io/thanos/pkg/store.(*blockSeriesClient).nextBatch
    1.50MB  0.15% 95.68%    35.02MB  3.44%  github.com/thanos-io/thanos/pkg/store.populateChunk

CharlieTLe · 2024-10-14T21:01:45Z

Hi @elliesaber,

Unfortunately, setting querier.query-store-for-labels-enabled: false in v1.18.0 does not disable querying the store-gateway for labels since the flag was removed in #5984.

We could bring the flag back by reverting #5984. I'm not really sure why we decided to remove this flag instead of setting its default to true. Adding the flag back could help with users that are looking to upgrade to 1.18.0 without querying the store gateway for labels.

elliesaber · 2024-10-14T21:15:02Z

Thank you @CharlieTLe for the suggestion.

I agree that being able to set querier.query-store-for-labels-enabled manually instead of relying on the default behavior would be helpful for us. Reverting the flag and allowing users to control whether or not to query the store gateway for labels would give us more flexibility. This would likely prevent the significant CPU and memory spikes that are leading to OOMKills and help smooth the upgrade process to v1.18.0. We’d appreciate this addition as it would enable us to upgrade without running into these memory issues.

yeya24 · 2024-10-14T21:28:26Z

(pprof) top
Showing nodes accounting for 975.10MB, 95.68% of 1019.09MB total
Dropped 206 nodes (cum <= 5.10MB)
Showing top 10 nodes out of 66
      flat  flat%   sum%        cum   cum%
  464.82MB 45.61% 45.61%   464.82MB 45.61%  github.com/thanos-io/thanos/pkg/block/indexheader.(*BinaryReader).init.func3
  178.29MB 17.50% 63.11%   683.35MB 67.05%  github.com/thanos-io/thanos/pkg/block/indexheader.(*BinaryReader).init
  129.10MB 12.67% 75.78%   129.10MB 12.67%  github.com/thanos-io/thanos/pkg/pool.NewBucketedBytes.func1
   76.84MB  7.54% 83.32%    76.84MB  7.54%  github.com/thanos-io/thanos/pkg/cacheutil.NewAsyncOperationProcessor
   64.77MB  6.36% 89.67%    65.77MB  6.45%  github.com/bradfitz/gomemcache/memcache.parseGetResponse
   40.23MB  3.95% 93.62%    40.23MB  3.95%  github.com/prometheus/prometheus/tsdb/index.NewSymbols
   13.94MB  1.37% 94.99%    13.94MB  1.37%  github.com/klauspost/compress/s2.NewWriter.func1
    4.10MB   0.4% 95.39%   687.45MB 67.46%  github.com/thanos-io/thanos/pkg/block/indexheader.newFileBinaryReader
    1.50MB  0.15% 95.54%     5.55MB  0.54%  github.com/thanos-io/thanos/pkg/store.(*blockSeriesClient).nextBatch
    1.50MB  0.15% 95.68%    35.02MB  3.44%  github.com/thanos-io/thanos/pkg/store.populateChunk

I don't think the heap dump above shows the issue was label values touching store gateway. The heap dump was probably not at the right time as your memory usage showed that it could go to 48GB.

For the memory usage metric, are you using the container_working_set_bytes metric or the heap size metric.

Another thing that might help with the issue is setting GOMEMLIMIT. But we need to understand the root cause of the OOM kill first.

CharlieTLe · 2024-10-15T02:11:05Z

This message seems pretty telling that it is caused by the behavior controlled by the flag querier.query-store-for-labels-enabled.

We first tried setting querier.query-store-for-labels-enabled: true in version 1.17.1. After making this change, we observed that the Store Gateway Pods started frequently encountering OOMKills, with both CPU and memory usage spiking far beyond their usual levels.

If we ignored the heap dump, it does seem possible that there is a label with a very high cardinality. If there is no limit to how many label values could be queried, I could imagine that the store-gateway could be overwhelmed with fetching all of the values possible for a label.

elliesaber · 2024-10-15T16:11:13Z

@yeya24 we used container_memory_working_set_bytes in the graph screenshot you see

yeya24 · 2024-10-27T20:00:51Z

Thanks and sorry for the late response. @elliesaber How does metric go_memstats_alloc_bytes looks like? This is your heap size.

If you confirmed that the OOM kill was caused by query-store-for-labels-enabled change, I think we can add the flag back as it break user experience.

dpericaxon · 2024-10-29T23:30:36Z

Hey @yeya24 we believe its related to that flag. This is what the go_memstats_alloc_bytes looked like for the different store-gateways. Let me know if the image below helps or if you need more info or anything clearer!

yeya24 · 2024-10-30T01:43:27Z

@dpericaxon I don't think the graph showed that the flag is related. It looks more related to a deployment.

Do you have any API requests that ask for label names/values at the time of the spikes? The flag is related to those labels API so we need evidence to prove that the API caused the memory increase. You can reproduce this by calling the API manually yourself.

dosubot bot added component/store-gateway type/bug labels Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cortex v1.18.0 Upgrade Causing OOMKills and CPU Spikes in Store-Gateway #6259

[BUG] Cortex v1.18.0 Upgrade Causing OOMKills and CPU Spikes in Store-Gateway #6259

dpericaxon commented Oct 10, 2024

CharlieTLe commented Oct 13, 2024

alanprot commented Oct 13, 2024

CharlieTLe commented Oct 13, 2024

alanprot commented Oct 13, 2024

CharlieTLe commented Oct 13, 2024

yeya24 commented Oct 13, 2024

elliesaber commented Oct 14, 2024

CharlieTLe commented Oct 14, 2024

elliesaber commented Oct 14, 2024

yeya24 commented Oct 14, 2024

CharlieTLe commented Oct 15, 2024

elliesaber commented Oct 15, 2024

yeya24 commented Oct 27, 2024

dpericaxon commented Oct 29, 2024 •

edited

Loading

yeya24 commented Oct 30, 2024

[BUG] Cortex v1.18.0 Upgrade Causing OOMKills and CPU Spikes in Store-Gateway #6259

[BUG] Cortex v1.18.0 Upgrade Causing OOMKills and CPU Spikes in Store-Gateway #6259

Comments

dpericaxon commented Oct 10, 2024

CharlieTLe commented Oct 13, 2024

alanprot commented Oct 13, 2024

CharlieTLe commented Oct 13, 2024

alanprot commented Oct 13, 2024

CharlieTLe commented Oct 13, 2024

yeya24 commented Oct 13, 2024

elliesaber commented Oct 14, 2024

CharlieTLe commented Oct 14, 2024

elliesaber commented Oct 14, 2024

yeya24 commented Oct 14, 2024

CharlieTLe commented Oct 15, 2024

elliesaber commented Oct 15, 2024

yeya24 commented Oct 27, 2024

dpericaxon commented Oct 29, 2024 • edited Loading

yeya24 commented Oct 30, 2024

dpericaxon commented Oct 29, 2024 •

edited

Loading