Downsampling of metric data after a certain period #1834

ManuelBauer · 2022-05-06T20:58:45Z

ManuelBauer
May 6, 2022

Hello everyone :)

I'm using Grafana Mimir as Prometheus data backend for a few days now and I'm very pleased with its performance.

Now I have a question about data downsampling, because I didn't find anything about it in the documentation. As far as I understand Mimir will store all data for an unlimited time as long as I don't configure a retention time. Is it possible to configure it, that after some time (for example a week) the raw data will be aggregated as (for example) one hour data points which will be represented by a min, max and average value and the corresponding raw data will be deleted?

Would be interesting if something like this is possible, because I don't need the raw data of every metric for all time, but some trend metrics would be nice. Because for reports like "how many http requests hit the loadbalancer two years ago" I don't need every 10s data point 😄.

Thanks in advance for your help :)

Kind regards,
Manuel

Answered by pstibrany

May 11, 2022

Thank you for your question. There are several problems with downsampling that need to be considered:

when using object store like GCP or S3, storage costs are typically only small fraction of cost of running Mimir, so saving here may not be high.
downsampling is very IO intensive. Because it's not possible to modify TSDB blocks in place, downsampling requires that blocks are downloaded, rebuilt from scratch with downsampled series (possibly only some of them, based on configuration) and then uploaded back. Old blocks must be deleted. All this processing adds to the cost.
downsampling complicates querying. PromQL query engine uses single look-back period when looking for samples. Ty…

View full answer

56quarters · 2022-05-06T21:23:08Z

56quarters
May 6, 2022
Maintainer

Hello!

No, down sampling is not currently supported. I'm not sure if there are plans to work on it in the future. Are you interested in this because long range queries are taking a long time? Or to save on storage costs? Or some other reason?

1 reply

ManuelBauer May 6, 2022
Author

Thanks for the fast answer :)

My main interest is to reduce storage space/cost in the long run (and as a bonus faster queries, because of the reduced dataset 😄).

pstibrany · 2022-05-11T04:44:14Z

pstibrany
May 11, 2022
Maintainer

Thank you for your question. There are several problems with downsampling that need to be considered:

when using object store like GCP or S3, storage costs are typically only small fraction of cost of running Mimir, so saving here may not be high.
downsampling is very IO intensive. Because it's not possible to modify TSDB blocks in place, downsampling requires that blocks are downloaded, rebuilt from scratch with downsampled series (possibly only some of them, based on configuration) and then uploaded back. Old blocks must be deleted. All this processing adds to the cost.
downsampling complicates querying. PromQL query engine uses single look-back period when looking for samples. Typically this is configured to 5 minutes. Doing downsampling to 1h (with 1 sample per hour) would then cause lot of gaps, if start/end times and step were not carefully selected to work with downsampled data. It might be possible to use longer look-back period, but this is applied "globally" for all series, regardless of which time range is queried (ie. same look-back is used for old data as well as new data). If we wanted to provide different downsampling options to different series, single lookback would not work.

These are just random thoughts that we would need to take into consideration when designing downsampling feature.

2 replies

pracucci May 13, 2022
Maintainer

Great write up @pstibrany.

Another thing to consider is that in order to provide accurate query results we would have to keep multiple values for 1 downsampled period, like the min or max of a series samples for each downsampled period. There's still opportunity to reduce the number of samples with downsampling (and so reduce storage, and increase query performance) but it's not a reduction from N to 1 per downsampled period.

anthonyeleven Nov 23, 2023

A block by default is two hours of metrics? Might one be able to achieve coarse downsampling of historical data by deleting, say, every other block? Would it be possible to decrease the block size to decrease the blast radius of a deletion?

BlueBlue-Lee · 2022-07-07T02:17:41Z

BlueBlue-Lee
Jul 7, 2022

I don't care the storage cost but care user experience for long time range query. The main goal of downsampling is providing an opportunity to get fast results for long range queries like months or years.

0 replies

wilfriedroset · 2022-07-27T13:24:50Z

wilfriedroset
Jul 27, 2022
Collaborator

As discussed on slack, I would like to share my usecases
ref: https://grafana.slack.com/archives/C039863E8P7/p1656596260566049

As users, we would like to query a wide range of serie(s). The full resolution is not mandatory. However, when runnning a wide range query the response time impact the user experience.
In this case, downsampling would help to not query all samples and therefore improve reponse time.
Depending on the query, I would like to be able to chose between: min, max, avg, count, sum.
On top on the previous downsampling method, I would like to be able to downsample my series while retaining the visual aspect of the chart. LTTB could be useful in this case.
For one serie we would add 6 more (min, max, avg, count, sum, LTTB) as such it is expected to increase the storage cost.
To improve flexibility, as mimir can be used as multi tenant, the downsampling should be configurable at the cluster level (via config) and the tenant level (via runtime config)

# cluster level configuration
compactor:
  downsampling:
    - 1d:1m # After 1d apply downsampling add keep 1 sample per minute
    - 2d:5m
    - 2w:1h

# runtime configuration
overrides:
  tenant1:
    downsampling:
      - 1d:1m # After 1d apply downsampling add keep 1 sample per minute
      - 5d:5m
      - 4w:1h

We can even think of downsampling differently a subset of the series associated with a tenant based on a regex.

overrides:
  tenant1:
    downsampling:
      ".*":
        - 1d:1m # After 1d apply downsampling add keep 1 sample per minute
        - 5d:5m
        - 4w:1h
      "cpu.*":
        - 1d:1m # After 1d apply downsampling add keep 1 sample per minute
        - 4w:1h

In some case, once downsampled, the full resolution might not been needed.
As such it could be wishable to be able to define several retention period based on serie name regex. This would help to lower the storage cost.

Examples

High frequency sampling

End users want to sample at 1Hz but the full resolution is needed only for a short period of time.

Capacity planning

Keep low resolution data for capacity planning.

Indeed, the more data we have in the past the more accurate the forecasting is.

The use case could be several years (2 to 5 years).

Prune full resolution

Only keep the downsampled data after a pre-defined period of time

Additional context

influxdata/influxdb#23108

side note

This would help users migrate from other backend without feature loss:

Thanos https://thanos.io/tip/components/compact.md/#downsampling
TimescaleDB https://docs.timescale.com/promscale/latest/downsample-data/caggs/#continuous-aggregates-in-promscale
VictoriaMetrics https://docs.victoriametrics.com/#downsampling (paid feature)
M3DB https://m3db.io/docs/how_to/m3aggregator/

0 replies

BenoitPoulet · 2022-10-26T10:27:44Z

BenoitPoulet
Oct 26, 2022

Hello,

I must choose a new metric storage and I would like to use downsampling.

Is this planned to add this to Mimir or I must go with Thanos ?

Tky.

11 replies

09jvilla May 26, 2023
Collaborator

And this statement about 10% is totally wrong in our case.

@sepich can you give us a sense of what fraction of your costs to run Mimir are attributable to object storage if you're saying that the 10% number is way off?

sepich May 27, 2023

We do not run Mimir because it does not support downsampling, we run Thanos. In Thanos you can ship blocks directly to object storage from prometheus (skipping remote-write). In our case most of blocks are being shipped this way, and thanos-receive is a smaller part.
GSC now is 3.48Tb and 3years of data. It is 3480Gb*$0.020=69.6$/mo (no IOPS cost here)
It is only 2 weeks of raw (15s) data stored, and the rest is 1h.
So if we store 3y of raw blocks - it would be 3.48Tb * 3y * (52w in a year / 2w of raw now) = 271.4Tb = 5428$/mo

Now to compute costs:
sum of thanos-store + thanos-query + thanos-query-frontend + thanos-compact container resources is 2cpu/10Gi
We use gke spot instances, so it is 20$/mo

wilfriedroset May 29, 2023
Collaborator

To give another point of view. I'm running on OVHcloud (full disclosure this is where I work).
My biggest, and oldest, cluster host 1TB of data for ~9months of production data which various scraping interval.
The storage for raw data account for 1010GB*0.025€=25.25€/mo (IOPS included)
The compute account for 9k€

So even with an eventual downsampling feature I'm still far away from the expected 10% of cost due to storage.

09jvilla May 31, 2023
Collaborator

Thanks for that datapoint for Mimir, Wilfried. So for you the object storage costs are actually significantly smaller than 10% of TCO. Of course this number will grow as you grow your retention period beyond nine months, but even if you quadruple that to 36 months (3 years), then we'd expect roughly 100 euro/month. That would put you at slightly more than 10% of your total infra spend on Mimir. (Note that our ~10% estimate is based on the expectation of 13 months retention, which is what we default to).

I'm not as familiar with how Thanos works so I can't chime in on why there is such a discrepancy relative to the Thanos based estimate from @sepich .

wilfriedroset Jun 1, 2023
Collaborator

Depending on how I scale my cluster, the storage costs may remain around 10%. Both compute and storage cost grow with the consumption of Mimir's tenant.

For exemple, we need to account for 1 more CPU core for every 300k additional active series, same goes for the distributor with the number of sample/sec. Following this logic we need to account for the rest of the compute. This could add up rather quickly.

Second example, with the default retention of 13 months (reported by @09jvilla), if a tenant loads a dashboard for a 90d timerange 6 months ago (e.g: quarter to quarter comparison), we should expect a fair amount of queries which will be splitted and sharded. This weight down on the read path especially on the querier and the store-gateway and therefore the bill.

wilfriedroset · 2022-11-10T10:09:19Z

wilfriedroset
Nov 10, 2022
Collaborator

I think what I'm trying to understand is that if Mimir were to implement downsampling in a way similar to Thanos, where downsampling would help run queries over long time periods, but would not help reduce the amount of data in object storage, would that be good enough? Which of these 2 use cases - decreasing storage footprint vs better performance on multi-month/multi-year queries - is more important?

I reckon performance on multi-month/multi-year queries is more important than the storage cost. Object storage pricing is generally low compared to the compute pricing.

On alternative to both tackle performance and cost is to have downsampling and different retention per series. This way you could keep the downsampled metrics longer than the raw metrics. This is not possible at the moment as all metrics of a tenant are kept for the same amount of time.

0 replies

BenoitPoulet · 2022-11-10T11:06:56Z

BenoitPoulet
Nov 10, 2022

On alternative to both tackle performance and cost is to have downsampling and different retention per series. This way you could keep the >downsampled metrics longer than the raw metrics.

That's exactly what we need. In our case, we don't care about precises metrics over time, we need to see the trends.
Zooming on a range of 10 mins, 8 months ago is not useful for us.

We would like to have high resolutions metrics for a short range like 1 month, medium resolution metrics after that for like 6 months, and low resolution metrics for several years.

0 replies

Otterpohl · 2023-05-15T14:50:45Z

Otterpohl
May 15, 2023

Is it maybe worth opening an issue for this since there is appetite from the community? i too have this requirement and for the same reasons as those above ^^

8 replies

adirmatzkin May 18, 2023

Awesome idea 🔥
Thank you for sharing!

wilfriedroset May 29, 2023
Collaborator

Wow, that's nice to share your tips.
However, I reckon that the downsampling feature is needed. DIY solutions like using the rulers and/or pushing already downsampled data might not be possible (e.g: network devices, IOT).
If a tenant starts to use the rulers to downsampled all of its metrics the induced load will be rather high for the rulers.
Hence my proposal to add a downsampling feature directly in mimir.

jrhrmsll Oct 6, 2023

Tested the federate approach and works well. It is also possible the do the recording in Prometheus and use a proxy to set the tenant based on metrics labels.

Leegin-darknight Mar 7, 2024

@adirmatzkin Amazing idea. I have one question about it. Even though I scrape the metrics from the target with a higher scrape interval, the metric name remains the same. So how would we separate that the low-resolution metrics have to be sent to a different tenant when the metric names are the same? Wouldn't it send the same metrics to both tenants?

adirmatzkin Mar 10, 2024

The point is to set the recording rule for a "long-term-storage tenant" to query the "original" tenant - that way the original and "downsampled" series won't be queried together when using them later.

Have you seen @MrFreezeex's comment?
I believe you're missing this piece in your puzzle 🙃

wilfriedroset · 2023-05-16T19:49:02Z

wilfriedroset
May 16, 2023
Collaborator

I've been thinking about a proposal for a long time. Here is what I've come up.
Please contribute so that we can all together design this feature: #5028

Be advised it is not as easy as it looks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downsampling of metric data after a certain period #1834

{{title}}

Replies: 9 comments 22 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Downsampling of metric data after a certain period #1834

Replies: 9 comments · 22 replies

56quarters May 6, 2022 Maintainer

ManuelBauer May 6, 2022 Author

pstibrany May 11, 2022 Maintainer

pracucci May 13, 2022 Maintainer

wilfriedroset Jul 27, 2022 Collaborator

Examples

High frequency sampling

Capacity planning

Prune full resolution

Additional context

side note

09jvilla May 26, 2023 Collaborator

wilfriedroset May 29, 2023 Collaborator

09jvilla May 31, 2023 Collaborator

wilfriedroset Jun 1, 2023 Collaborator

wilfriedroset Nov 10, 2022 Collaborator

wilfriedroset May 29, 2023 Collaborator

wilfriedroset May 16, 2023 Collaborator

Replies: 9 comments 22 replies

56quarters
May 6, 2022
Maintainer

ManuelBauer May 6, 2022
Author

pstibrany
May 11, 2022
Maintainer

pracucci May 13, 2022
Maintainer

wilfriedroset
Jul 27, 2022
Collaborator

09jvilla May 26, 2023
Collaborator

wilfriedroset May 29, 2023
Collaborator

09jvilla May 31, 2023
Collaborator

wilfriedroset Jun 1, 2023
Collaborator

wilfriedroset
Nov 10, 2022
Collaborator

wilfriedroset May 29, 2023
Collaborator

wilfriedroset
May 16, 2023
Collaborator