Monitoring cache performance using Sentry #68265

bcoe · 2024-04-04T15:03:43Z

bcoe
Apr 4, 2024
Maintainer

Hi! I’m Ben, I recently joined Sentry as a Product Manager working on Performance Monitoring. 👋

I’d like to get in the habit of sharing features we’re exploring early, so that our most engaged users (✨ You! ✨) can help shape the design.

With this goal in mind, I’m seeking feedback on Cache performance monitoring…

Cache performance monitoring

Cache monitoring will be similar to query monitoring except, instead of queries, it provides insights into your application's cache behaviour.

Here are some cache performance questions we hope to help developers answer:

Do calls to a backend API endpoint miss looking up a value from the cache more than expected? A high percentage of misses could mean you’re not fully realizing the benefits of caching and might be fixable with small configuration or code changes, e.g., increasing TTLs on cache keys.
Did cache misses suddenly increase? This could eventually lead to service disruptions and, in some cases, may be fixed by rolling back a bad release.
Does a cache backed endpoint in your application have large spikes in latency? This gives your users an inconsistent experience and may be as easy to fix as adding a jitter to TTLs.

We landed on these use cases initially, because they came to mind as real-world application performance regressions Sentry can help identify and fix.

Request for feedback

Some questions to help kick off this conversation:

Do the use cases we’ve outlined above resonate with you (have we missed answering questions that matter to your application’s cache usage?)
Are transactions (usually corresponding to endpoint paths) the right level of granularity for helping you dig into cache issues?
What cache libraries are folks using in their applications (Django’s cache framework? redis directly?).

Mockups

Cache overview page

This page serves as a starting point for digging into specific cache performance issues.

Perhaps you’ve noticed requests are occasionally slow hitting an endpoint configured to use Django’s cache framework. Starting on the Cache Overview page, you can identify whether the endpoint in question has a higher than expected Miss % (across all cache reads within the transaction). At this point, you can click the transaction itself for details about its corresponding spans.

Transaction overlay

The Transaction overlay allows you to dig into cache performance issues tied to a specific transaction:

Average Hit Duration and Average Miss Duration help you understand the cost of a cache miss in your application (ideally users still get a reasonable experience even if they miss cache).
The Event summary table allows you to investigate spans tied to individual cache reads.

Looking forward to people’s feedback in this discussion.

Alternatively, if you’d rather reach out by email, you can find it here

— @bcoe

DominusKelvin · 2024-04-11T06:16:10Z

DominusKelvin
Apr 11, 2024

Oh this is really good @bcoe. Congratulations on joining Sentry by the way.

I recently started adopting caching intentionally in my apps and I really have been thinking "Ain't this a black box in terms of monitoring?".

Like how do I know if there was a cache hit or cache hit or if something is awry with my entire cache setup with Sails Stash?

So definitely cache monitoring from Sentry will be amazing for me to look into "What's up with my cache". My cache uses Redis as the cache backend in the Sails framework.

Currently in Hagfish the only cache I have in terms of application-level cache is on the count of sent invoices which I cache for a week.

I also plan on using Sentry for Sailscasts and it caches the courses on the courses page for a month.

I like the cache monitoring mockups and having the granularity of insights into cache issues be on transactions is fine as well.

Again I currently use Sails Stash which provides a cache abstraction that currently supports Redis for caching by using the sails-redis Waterline adapter and I will be keen to see how this will work.

Great job with this one @bcoe

2 replies

bcoe Apr 12, 2024
Maintainer Author

Oh this is really good @bcoe. Congratulations on joining Sentry by the way.

Thank you 😄

I like the cache monitoring mockups and having the granularity of insights into cache issues be on transactions is fine as well.

Awesome, It sounds like our proposal matches your use case pretty well. Shall I reach out once we have early access and you can try it out with Hagfish and Sailscasts?

For your case, I'm picturing instrumentation would work one of two ways:

You would provide a configuration option to the Sentry Redis integration, telling the integration that a certain key-space is being used for caching, or.
We would provide detailed documentation on how to emit appropriate spans, using Sentry, from sails-redis.

DominusKelvin Apr 12, 2024

Yes @bcoe please reach out as I want to be an early adopter of cache monitoring on Sentry!!!!

andrew-productiv · 2024-04-11T17:20:44Z

andrew-productiv
Apr 11, 2024

Hey Ben! Super excited to see that y'all are thinking about this. I spend as much time as I can working with Sentry to improve our monitoring - mostly with an eye toward performance. Below is what I often think about re: caching that I think Sentry might be able to help with.

Current state

We already have Sentry deployed both in our Vue.js FE and Node.js BE. We use the vue router + express integrations to track page loads and API calls as transactions. In many of our APIs, we use a Redis cache both to reduce redundant compute and speed up certain DB queries that might otherwise be heavier than desired.

Our Redis layer is custom - we can wrap a given async function with a caching utility that works essentially like lodash memoize - every call to the function first attempts to read a value from cache, otherwise we "read through" by calling the underlying function (which likely makes a DB call and/or does some heavy compute on some data) and write the value to cache for next time. I've written a sentry wrapper around our underlying redis utils that tracks calls to redis as spans within the API call transactions. This is helpful for debugging of individual transactions, but is borderline useless for trying to understand the impact of caching across large numbers of transactions.

Any given API call likely makes an average of 1.5-2 cache reads for us. For example, we might have endpoint /foo that always reads from an auth cache to verify user auth + profile information. Then, assuming auth is successful it'll attempt to read from a cache of previous results for /foo calls with the given params, and serve that canned response if possible. If a miss, then it'll fall back to reading from the DB (which might itself be behind another "raw data" cache) and running the full compute over the data.

Questions I'm interested in

For the most part, I think there are two ways to think about cache health + performance: either from the perspective of a logical cache (e.g. accesses to + health of the auth data cache across all APIs we have) or from the perspective of a given transaction (e.g. the foo-precomputed-results cache that /foo checks before doing its work). For the most part, I am much more interested in caches from the perspective of (and their impact on) a given transaction of another type (API call, page load) than I am about looking at caches on their own. I've labelled these categories below.

Here are the types of things I'm interested in, in roughly order of decreasing importance:

[1] [Transaction Perspective] What is the difference in total endpoint performance of /foo in the cached vs uncached cases. That is - can I group GET /foo transaction events into chunks like "hit all caches it attempted to read" or "missed the foo-precomputed-results cache" or "missed the foo-raw-data cache"). In this bucket, I ask questions like:

How often does /foo (as a % of total volume) have a "happy path" case where it hits all caches, and how often does it miss when it reads foo-precomputed-results (both as a % of total volume and as a % of attempts to read that cache, since reads can be conditional).
Related to the above - how often does /foo attempt to read (or write) a given cache? (This is essentially span frequency as it exists today in Sentry).
What are the perf characteristics of /foo when it hits caches vs misses caches. Note that this is most useful if we track cache misses separately when they are misses to different caches - a miss to the auth cache is a lot less painful to us than a miss to the foo-precomputed-results cache, I don't want one to mislead me about the other.
- An ideal here would be a cache usage summary for a given API call transaction that looks like this:
  - Happy path (all hits) calls to /foo take X ms (avg, p90, etc...)
  - Cache A miss is an X ms penalty (p90, etc..) and happens in Y% of /foo calls (Z% of /foo calls attempt to read cache A)
  - Cache B miss is an X ms penalty (p90, etc..) and happens in Y% of /foo calls (Z% of /foo calls attempt to read cache B)
When looking at /foo, what patterns of parent transactions that call /foo are more likely to hit or miss caches. i.e. When we load page X, is it 300% more likely than page Y to make a /foo call with params that we won't have previously cached? This helps me identify cases where our existing hit rate can be improved.

[2] [Transaction Perspective] I'm also very interested in the impact of caches on the perf of higher-order transactions. e.g. Page load X makes calls to a dozen endpoints, many of which may rely on cache reads. How often does Page X actually have a happy-path load where most/all of the APIs it calls hit warm caches? Ideally, aggregate cache hit/miss information would be visible up the full chain of transactions, not just on the immediate parent of the cache read.

[3] [Cache Perspective] I am also interested in the health of a cache from its own perspective: Did our auth data cache get slower across the board recently? Did the cost of a get call in general get slower?

As an example, we recently had a network configuration issue that created a 200ms penalty on cache hits for some accesses. Sentry actually caught the issue as a performance regression on an endpoint that depends on the cache, but from there I had a hard time telling the difference between "this endpoint is missing the cache more often than usual, and so needs to do the slower read-through" vs "this endpoint is hitting just as often, but the hits themselves have degraded".

[4] [Cache Perspective] What's the average value size (in bytes) of values read from/written to X cache? How does the performance of reads + writes scale with different value sizes?

Other thoughts

Key cardinality is high, so measuring stats against a specific key is unlikely to be helpful in aggregate. For example, there might be thousands of keys within the logical "auth data" cache - most of the time I want to think about the cache as an "auth data" cache regardless of which specific user (and therefore key) is being accessed. These are logical divisions of keys shared within a single redis instance, so being able to specify "when this function runs, it's part of X logical cache" is important (this is essentially transaction names). Occasionally, per-key info might be handy to identify outliers - e.g. "key Y is WAY hotter than all other key accesses in cache X" might be handy to understand a specific issue we're having, but I expect this to be much less commonly used than thinking about cache X as a whole.
Measuring misses is just as interesting as measuring hits. Really, every cache access has a few critical components: what logical cache is this accessing, was it a hit or a miss, how long did it take, and how big was the associated value? Then, within a given transaction (or chain of transactions) there's additionally "was cache X accessed during Y instance of a transaction (and what was the outcome)".
Tracking cache accesses as transactions might make sense, but leaves me with some slight concerns over overall transaction volume we'd be using with Sentry. Since on average our APIs might make 1.5 cache reads per call, tracking cache reads as transactions would more than double our total usage with Sentry. It'd take an absolutely killer cache analysis tool for me to be able to justify the additional spend on transaction capacity if that's how it was modeled (and billed).

I think the mocks you've posted so far generally are thinking about the right kinds of metrics re-caches but they appear to be very heavily focused on the cache perspective rather than the "usage of cache X within transaction Y" perspective. While that is itself quite useful, I'd say 90-95% of the time I'm thinking about caches from the perspective of their impact on a given transaction (or chain of transactions - like a full page load), not in isolation.

In general, this is a very greenfield wishlist post - I'm hoping to cover the shape of my overall cache thinking so you can pick and choose what you think is most important or where Sentry is best positioned to help. I'm more than happy to have any follow-up chats that you might want here - like I said, I'm very interested in something like this having first-class support in Sentry.

3 replies

bcoe Apr 12, 2024
Maintainer Author

Hey @andrew-productiv, thanks for taking the time to provide such thorough feedback, some thoughts inline:

What is the difference in total endpoint performance of /foo in the cached vs uncached cases?

For MVP, I see this as the most valuable question we could help developers answer.

When a backend route is accessed, if it emits cache spans, what percentage of those cache spans are HITs, vs., MISSes, and what's the duration of the span depending?

To your point, a miss to the auth cache is a lot less painful to us than a miss to the foo-precomputed-results cache, I could absolutely picture the next iteration of the feature allowing you to drill down into a cache namespace, e.g., auth:* (rather than averaging across all cache operations).

I'm also very interested in the impact of caches on the perf of higher-order transactions. e.g. Page load X makes calls to a dozen endpoints?

By higher order transaction, do you mean perhaps the distributed trace between the frontend application making several API calls to a backend API to hydrate a page?

For the frontend case like this, I wonder if the the right metric to look at is the HTTP status codes and response headers, rather than the cache operations themselves.

I am also interested in the health of a cache from its own perspective: Did our auth data cache get slower across the board recently

My thinking for this use case, was that it would make perfect sense to be able to see Redis (or memcached) operations in the query performance page. Or something similar, like a Key/Value store performance page?

This would give you the granular insights into the performance of specific cache operations, without overcomplicating the cache performance page (which would concentrate on the relationship between cache performance and endpoint load times).

Tracking cache accesses as transactions might make sense, but leaves me with some slight concerns over overall transaction volume we'd be using with Sentry. Since on average our APIs might make 1.5 cache reads per call, tracking cache reads as transactions would more than double our total usage with Sentry.

This is top of our mind, since cache operations tend to be in parts of applications where performance is a critical concern.

I think the mocks you've posted so far generally are thinking about the right kinds of metrics re-caches but they appear to be very heavily focused on the cache perspective rather than the "usage of cache X within transaction Y" perspective.

If we were to allow you to drill down to specific cache namespaces within a transaction, e.g., what was the HIT/MISS performance of transaction /foo interacting with the redis namespace auth:*.

Would this address the use case you raise here?

andrew-productiv Apr 13, 2024

When a backend route is accessed, if it emits cache spans, what percentage of those cache spans are HITs, vs., MISSes, and what's the duration of the span depending?

To your point, a miss to the auth cache is a lot less painful to us than a miss to the foo-precomputed-results cache, I could absolutely picture the next iteration of the feature allowing you to drill down into a cache namespace, e.g., auth:* (rather than averaging across all cache operations).

I think a simplistic boolean stat of "X% of /foo calls HIT for all cache accesses they attempted" and then "(100 - X)% = Y% of calls had at least one MISS" (and let me look at the difference in perf of /foo calls in those two cases) would be a super useful starting point. Note that that is a different metric than "X% of cache accesses across all /foo calls were HITs" - that metric has the potential to be skewed significantly from "X% of /foo calls experienced no MISSes" depending on the pattern of cache accesses /foo's implementation might have.

Of course, that Y% is going to potentially include a variety of different MISS scenarios that might carry very different penalties (auth misses are speedy, foo-precomputed-results might be 2 orders of magnitude slower), so the ideal would be to be able to break that down even further, but I certainly expect get a lot of use out of even an early version of this feature that can't do that kind of breakdown.

By higher order transaction, do you mean perhaps the distributed trace between the frontend application making several API calls to a backend API to hydrate a page?

For the frontend case like this, I wonder if the the right metric to look at is the HTTP status codes and response headers, rather than the cache operations themselves.

Yes, apologies for the made-up lingo 😂 - I'm referring to distributed traces where the FE might open a transaction for a page load which involves several API calls (each of which might be tracked with their own transactions in the BE services). When thinking about a full page load from an FE's perspective, there's actually potentially multiple kinds of caches that might be relevant:

[1] Cached resources (html/js/css/images) either statically within the browser or validated against a CDN'd copy with headers.
[2] API calls whose entire responses are cached automatically by the browser and the If-None-Match and ETag headers (and the 304 status code) are used to save on network bandwidth. (Other related headers, like If-Modified-Since, might also fall into this bucket.)
[3] API calls that may involve one or more caches within the BE's execution (this is like my /foo example that might involve reads to an auth cache, and a partial precomputation cache, etc...).

Item 1 hadn't occurred to me at all in this discussion - I'd be curious to see stats about these things but I'm not convinced that's relevant to this particular feature. Item 2 is one way to capture cases where the browser was able to transparently optimize the API call's network traffic, but since the BE service still needs to compute the full response in order to do ETag matching and determine whether a 304 is a valid response, tracking that doesn't make a huge difference in terms of overall perf unless the API calls are super fast and the responses are very large (and therefore network delay dominates).

In my post above, I was referring specifically to item 3. In that case, the status code is gonna be a 200 regardless of the outcome of any cache accesses that the BE made while running. I could invent a system that - within each BE call - detects cache accesses and appends info to a response header, which the FE can then aggregate across all API calls it has made and attach summary information to transactions. Of course, the exact value Sentry would provide here is that I don't have to propagate that tracing information all the way back up the tree (or, in fact, expose any of it to clients at all).

My thinking for this use case, was that it would make perfect sense to be able to see Redis (or memcached) operations in the query performance page. Or something similar, like a Key/Value store performance page?

This would give you the granular insights into the performance of specific cache operations, without overcomplicating the cache performance page (which would concentrate on the relationship between cache performance and endpoint load times).

Yep, I love this plan.

Out of curiosity, how special-cased to redis (or memcached and friends) do you think something like this might be on the implementation side? For example, we have custom caches written against various other data stores (in-memory caches, S3-backed blobs that cache heavy DB queries, etc...). I would love to instrument those with something like this if possible. I'm, of course, very happy to write that integration myself for these cases - they're essentially all just "key/value store with X hit/miss ratio, Y ms time for a hit, Z ms penalty for a miss".

If we were to allow you to drill down to specific cache namespaces within a transaction, e.g., what was the HIT/MISS performance of transaction /foo interacting with the redis namespace auth:*.

Would this address the use case you raise here?

Assuming I have a way to tell Sentry which spans correspond to which logical caches/cache namespaces, then yes this would be exactly what I'm hoping for. (e.g. if the only way Sentry understood namespaces within a cache was by looking at prefixes of cache keys, that wouldn't be sufficient for our use cases - some of our key structures are pretty funky and prefixes would over-group certain cache namespaces and under-group others).

I doubt this needs to be said but for thoroughness - the duration number we care about for MISS cases is not "the time it took for redis to reply with 'I don't have this key'" but rather "the time it took for whatever fallback operation that was triggered by the miss to complete". Knowing how fast redis replied in cases of misses might be an interesting number, but I'd expect that number to be extremely small unless redis itself - or the network - was unhealthy. I'm far more interested in understanding the penalty I pay in application code when the cache misses. (And how often that happens within a given transaction.)

Thanks Ben!

bcoe Apr 17, 2024
Maintainer Author

I think a simplistic boolean stat of "X% of /foo calls HIT for all cache accesses they attempted" and then "(100 - X)% = Y% of calls had at least one MISS" (and let me look at the difference in perf of /foo calls in those two cases) would be a super useful starting point.

👍 I've passed your argument thinking here along to the engineer I'm collaborating on this work with.

In my post above, I was referring specifically to item 3. In that case, the status code is gonna be a 200 regardless of the outcome of any cache accesses that the BE made while running. I could invent a system that - within each BE call

Great. I think we're on the same page that this module will give you insights about caches, from a backend perspective. As discussed, we can iterate and try to figure out a way to help you drill down into endpoints that make multiple cache hits (auth cache / partials as an example).

Out of curiosity, how special-cased to redis (or memcached and friends) do you think something like this might be on the implementation side? For example, we have custom caches written against various other data stores (in-memory caches, S3-backed blobs that cache heavy DB queries, etc...). I would love to instrument those with something like this if possible.

The plan is that we'll clearly document (ideally with code snippets) how to instrument your codebase to emit cache spans. So, instrumenting your S3 example should be no problem.

I doubt this needs to be said but for thoroughness - the duration number we care about for MISS cases is not "the time it took for redis to reply with 'I don't have this key'" but rather "the time it took for whatever fallback operation that was triggered by the miss to complete".

👍 We have this on mind already. Our idea for MVP is that duration calculations would be based on the duration of the parent transaction. I understand this is a heuristic measurement, and the ideal would be knowing the exact lines of code gated by a cache hit (but this potentially adds a lot of complexity).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring cache performance using Sentry #68265

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Monitoring cache performance using Sentry #68265

bcoe Apr 4, 2024 Maintainer

Cache performance monitoring

Request for feedback

Mockups

Cache overview page

Transaction overlay

Replies: 2 comments · 5 replies

DominusKelvin Apr 11, 2024

bcoe Apr 12, 2024 Maintainer Author

DominusKelvin Apr 12, 2024

andrew-productiv Apr 11, 2024

Current state

Questions I'm interested in

Other thoughts

bcoe Apr 12, 2024 Maintainer Author

andrew-productiv Apr 13, 2024

bcoe Apr 17, 2024 Maintainer Author

bcoe
Apr 4, 2024
Maintainer

Replies: 2 comments 5 replies

DominusKelvin
Apr 11, 2024

bcoe Apr 12, 2024
Maintainer Author

andrew-productiv
Apr 11, 2024

bcoe Apr 12, 2024
Maintainer Author

bcoe Apr 17, 2024
Maintainer Author