Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler: add support for caching rule group contents #9386

Closed
4 of 5 tasks
56quarters opened this issue Sep 23, 2024 · 0 comments
Closed
4 of 5 tasks

Ruler: add support for caching rule group contents #9386

56quarters opened this issue Sep 23, 2024 · 0 comments
Assignees
Labels
component/ruler enhancement New feature or request

Comments

@56quarters
Copy link
Contributor

56quarters commented Sep 23, 2024

Is your feature request related to a problem? Please describe.

Currently, the contents of rule groups for tenants are not cached. This means an API call to return all configuration for rule groups must perform a list call for object storage (this is sometimes cached) and then N additional object storage requests. For tenants with a large number of rule groups, this can result in a huge number of calls.

Most rule group configurations don't change very often. This makes them a good candidate for caching. However when they are changed, that needs to be reflected immediately in API results. So if we want to add caching for rule groups we also need a way to invalidate their cached contents.

Describe the solution you'd like

We should:

  • Add functionality to CachingBucket to invalidate cached object storage contents on mutations. (PR)
  • Unify "direct" and "cached' ruler storage backends in the Ruler so that cache invalidation works correctly. (PR)
  • Add further cache invalidation logic to prevent data races. (#591, #9546, #601, #9575)
  • Configure ruler storage to use caching for Get operations (PR)
  • [optional] Increase the cache TTL for rule groups cached via Get operations to something longer than a single evaluation interval

Describe alternatives you've considered

Doing nothing is simpler and doesn't require opening up the can of worms that is cache invalidation.

@56quarters 56quarters added component/ruler enhancement New feature or request labels Sep 23, 2024
56quarters added a commit that referenced this issue Sep 23, 2024
Invalidate content, existence, and attributes cached for a particular object
when the object is modified or deleted.

Part of #9386
56quarters added a commit that referenced this issue Sep 24, 2024
Invalidate content, existence, and attributes cached for a particular object
when the object is modified or deleted.

Part of #9386
@56quarters 56quarters self-assigned this Sep 25, 2024
56quarters added a commit that referenced this issue Sep 26, 2024
Instead of using two different `RuleStore` implementations within the Ruler,
use a single caching implementation and selectively disable caching when
required.

This change removes the "direct" `RuleStore` implementation from the Ruler's
gRPC and HTTP API layers. Instead, the caching implementation is used for all
calls. In cases where caching returning stale results would not be acceptable,
the caching is disabled _just_ for that call.

This allows rule group contents to be safety cached with the understanding
that it is safe to cache them because they will correctly invalidated when
deleted or modified.

Part of #9386
56quarters added a commit that referenced this issue Sep 26, 2024
Instead of using two different `RuleStore` implementations within the Ruler,
use a single caching implementation and selectively disable caching when
required.

This change removes the "direct" `RuleStore` implementation from the Ruler's
gRPC and HTTP API layers. Instead, the caching implementation is used for all
calls. In cases where caching returning stale results would not be acceptable,
the caching is disabled _just_ for that call.

This allows rule group contents to be safety cached with the understanding
that it is safe to cache them because they will correctly invalidated when
deleted or modified.

Part of #9386
56quarters added a commit that referenced this issue Sep 27, 2024
Instead of using two different `RuleStore` implementations within the Ruler,
use a single caching implementation and selectively disable caching when
required.

This change removes the "direct" `RuleStore` implementation from the Ruler's
gRPC and HTTP API layers. Instead, the caching implementation is used for all
calls. In cases where caching returning stale results would not be acceptable,
the caching is disabled _just_ for that call.

This allows rule group contents to be safety cached with the understanding
that it is safe to cache them because they will correctly invalidated when
deleted or modified.

Part of #9386
56quarters added a commit that referenced this issue Sep 30, 2024
Instead of using two different `RuleStore` implementations within the Ruler,
use a single caching implementation and selectively disable caching when
required.

This change removes the "direct" `RuleStore` implementation from the Ruler's
gRPC and HTTP API layers. Instead, the caching implementation is used for all
calls. In cases where caching returning stale results would not be acceptable,
the caching is disabled _just_ for that call.

This allows rule group contents to be safety cached with the understanding
that it is safe to cache them because they will correctly invalidated when
deleted or modified.

Part of #9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this issue Sep 30, 2024
Instead of using two different `RuleStore` implementations within the Ruler,
use a single caching implementation and selectively disable caching when
required.

This change removes the "direct" `RuleStore` implementation from the Ruler's
gRPC and HTTP API layers. Instead, the caching implementation is used for all
calls. In cases where caching returning stale results would not be acceptable,
the caching is disabled _just_ for that call.

This allows rule group contents to be safety cached with the understanding
that it is safe to cache them because they will correctly invalidated when
deleted or modified.

Part of #9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit to grafana/dskit that referenced this issue Oct 3, 2024
This change adds a synchronous version of `.Set` to Memcached and Redis
clients as well as the various `Cache` wrapper implementations. This
allows callers to set a key and be sure it exists in the cache. This
change also adds an `.Add` method which conditionally adds an item to
the cache only if it does not already exist.

This change is a prerequisite for grafana/mimir#9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit to grafana/dskit that referenced this issue Oct 3, 2024
This change adds a synchronous version of `.Set` to Memcached and Redis
clients as well as the various `Cache` wrapper implementations. This
allows callers to set a key and be sure it exists in the cache. This
change also adds an `.Add` method which conditionally adds an item to
the cache only if it does not already exist.

This change is a prerequisite for grafana/mimir#9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit to grafana/dskit that referenced this issue Oct 3, 2024
This change adds a synchronous version of `.Set` to Memcached and Redis
clients as well as the various `Cache` wrapper implementations. This
allows callers to set a key and be sure it exists in the cache. This
change also adds an `.Add` method which conditionally adds an item to
the cache only if it does not already exist.

This change is a prerequisite for grafana/mimir#9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this issue Oct 3, 2024
Instead of using two different `RuleStore` implementations within the Ruler,
use a single caching implementation and selectively disable caching when
required.

This change removes the "direct" `RuleStore` implementation from the Ruler's
gRPC and HTTP API layers. Instead, the caching implementation is used for all
calls. In cases where caching returning stale results would not be acceptable,
the caching is disabled _just_ for that call.

This allows rule group contents to be safety cached with the understanding
that it is safe to cache them because they will correctly invalidated when
deleted or modified.

Part of #9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this issue Oct 3, 2024
Instead of using two different `RuleStore` implementations within the Ruler,
use a single caching implementation and selectively disable caching when
required.

This change removes the "direct" `RuleStore` implementation from the Ruler's
gRPC and HTTP API layers. Instead, the caching implementation is used for all
calls. In cases where caching returning stale results would not be acceptable,
the caching is disabled _just_ for that call.

This allows rule group contents to be safety cached with the understanding
that it is safe to cache them because they will correctly invalidated when
deleted or modified.

Part of #9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this issue Oct 4, 2024
Instead of using two different `RuleStore` implementations within the Ruler,
use a single caching implementation and selectively disable caching when
required.

This change removes the "direct" `RuleStore` implementation from the Ruler's
gRPC and HTTP API layers. Instead, the caching implementation is used for all
calls. In cases where caching returning stale results would not be acceptable,
the caching is disabled _just_ for that call.

This allows rule group contents to be safety cached with the understanding
that it is safe to cache them because they will correctly invalidated when
deleted or modified.

Part of #9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit to grafana/dskit that referenced this issue Oct 7, 2024
* cache: Add `.Set` and `.Add` methods to cache clients

This change adds a synchronous version of `.Set` to Memcached and Redis
clients as well as the various `Cache` wrapper implementations. This
allows callers to set a key and be sure it exists in the cache. This
change also adds an `.Add` method which conditionally adds an item to
the cache only if it does not already exist.

This change is a prerequisite for grafana/mimir#9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Changelog

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Code review fixes

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Add test for .Add() method semantics for LRU cache

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Assert on cache contents for LRU `.Add()` test

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

---------

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit to grafana/dskit that referenced this issue Oct 8, 2024
Add an `.Advance()` method to MockCache and InstrumentedMockCache to
allow the time considered "now" to be moved without needing to actually
sleep. This is useful for testing when items are set with a TTL and you
would like for them to actually expire as they would in a real cache.

Part of grafana/mimir#9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit to grafana/dskit that referenced this issue Oct 8, 2024
Add an `.Advance()` method to MockCache and InstrumentedMockCache to
allow the time considered "now" to be moved without needing to actually
sleep. This is useful for testing when items are set with a TTL and you
would like for them to actually expire as they would in a real cache.

Part of grafana/mimir#9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit to grafana/dskit that referenced this issue Oct 9, 2024
Add an `.Advance()` method to MockCache and InstrumentedMockCache to
allow the time considered "now" to be moved without needing to actually
sleep. This is useful for testing when items are set with a TTL and you
would like for them to actually expire as they would in a real cache.

Part of grafana/mimir#9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this issue Oct 9, 2024
This change makes use of `add` operations for special "lock" cache
entries to ensure that when items in object storage are mutated, stale
results are not immediately stored to cache again.

It does this by `set`ing "lock" cache entries with a short TTL when an
item in object storage is about to be mutated. This prevents reads of
the item from caching the results afterwards. After the item is mutated
in object storage, its cache entries (excluding the lock entries) are
deleted. After the lock entries expire, reads of the item are allowed
to store results in the cache again.

Part of #9386
56quarters added a commit that referenced this issue Oct 9, 2024
This change makes use of `add` operations for special "lock" cache
entries to ensure that when items in object storage are mutated, stale
results are not immediately stored to cache again.

It does this by `set`ing "lock" cache entries with a short TTL when an
item in object storage is about to be mutated. This prevents reads of
the item from caching the results afterwards. After the item is mutated
in object storage, its cache entries (excluding the lock entries) are
deleted. After the lock entries expire, reads of the item are allowed
to store results in the cache again.

Part of #9386
56quarters added a commit that referenced this issue Oct 9, 2024
This change makes use of `add` operations for special "lock" cache
entries to ensure that when items in object storage are mutated, stale
results are not immediately stored to cache again.

It does this by `set`ing "lock" cache entries with a short TTL when an
item in object storage is about to be mutated. This prevents reads of
the item from caching the results afterwards. After the item is mutated
in object storage, its cache entries (excluding the lock entries) are
deleted. After the lock entries expire, reads of the item are allowed
to store results in the cache again.

Part of #9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this issue Oct 11, 2024
* Add more robust cache invalidation to CachingBucket

This change makes use of `add` operations for special "lock" cache
entries to ensure that when items in object storage are mutated, stale
results are not immediately stored to cache again.

It does this by `set`ing "lock" cache entries with a short TTL when an
item in object storage is about to be mutated. This prevents reads of
the item from caching the results afterwards. After the item is mutated
in object storage, its cache entries (excluding the lock entries) are
deleted. After the lock entries expire, reads of the item are allowed
to store results in the cache again.

Part of #9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Code review feedback

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

---------

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this issue Oct 11, 2024
Add a new experimental flag to enable caching of rule group contents
using the rule store cache. Rule groups are cached using the same TTL
as rule group listings: one evaluation interval.

Part of #9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this issue Oct 15, 2024
Add a new experimental flag to enable caching of rule group contents
using the rule store cache. Rule groups are cached using the same TTL
as rule group listings.

Part of #9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this issue Oct 16, 2024
Add a new experimental flag to enable caching of rule group contents
using the rule store cache. Rule groups are cached using the same TTL
as rule group listings.

Part of #9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this issue Oct 17, 2024
Exclude alerts from firing about cache "add" operations failing since
this is expected during normal operation.

Related #9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this issue Oct 17, 2024
Exclude alerts from firing about cache "add" operations failing since
this is expected during normal operation.

Related #9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this issue Oct 17, 2024
* mixin: Exclude cache "add" operations from alerting

Exclude alerts from firing about cache "add" operations failing since
this is expected during normal operation.

Related #9386

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Build helm tests

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

---------

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this issue Nov 20, 2024
When caching rule groups, the base64 encoded name of the rule group is
used as part of the key for caching its contents. For long rule group
names, this can exceed the max key length of Memcached. To solve this
we use the sha256 of the object name in the cache key instead of the
object name itself.

Related #9386
56quarters added a commit that referenced this issue Nov 20, 2024
When caching rule groups, the base64 encoded name of the rule group is
used as part of the key for caching its contents. For long rule group
names, this can exceed the max key length of Memcached. To solve this
we use the sha256 of the object name in the cache key instead of the
object name itself.

Related #9386
56quarters added a commit that referenced this issue Nov 21, 2024
When caching rule groups, the base64 encoded name of the rule group is
used as part of the key for caching its contents. For long rule group
names, this can exceed the max key length of Memcached. To solve this
we use the sha256 of the object name in the cache key instead of the
object name itself.

Related #9386
56quarters added a commit that referenced this issue Nov 22, 2024
When caching rule groups, the base64 encoded name of the rule group is
used as part of the key for caching its contents. For long rule group
names, this can exceed the max key length of Memcached. To solve this
we use the sha256 of the object name in the cache key instead of the
object name itself.

Related #9386
56quarters added a commit that referenced this issue Nov 25, 2024
When caching rule groups, the base64 encoded name of the rule group is
used as part of the key for caching its contents. For long rule group
names, this can exceed the max key length of Memcached. To solve this
we use a cryptographic hash of the object name when the hashed version
of the name is shorter than the full name. This is the same approach
taken for postings in the store-gateway `indexcache`.

This has the added benefit of not invalidating most existing cache
entries when rolling out this change. With the hash function picked,
key generation is between 5 and 10 times slower than _not_ hashing
the key but still dramatically faster than a network operation.

Related #9386
56quarters added a commit that referenced this issue Nov 26, 2024
When caching rule groups, the base64 encoded name of the rule group is
used as part of the key for caching its contents. For long rule group
names, this can exceed the max key length of Memcached. To solve this
we use a cryptographic hash of the object name when the hashed version
of the name is shorter than the full name. This is the same approach
taken for postings in the store-gateway `indexcache`.

This has the added benefit of not invalidating most existing cache
entries when rolling out this change. With the hash function picked,
key generation is between 5 and 10 times slower than _not_ hashing
the key but still dramatically faster than a network operation.

Related #9386
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/ruler enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant