Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caffeine (Async) Cache seems to ignore 'maximum size' on a certain scenario #43382

Closed
israelstmz opened this issue Sep 19, 2024 · 9 comments
Closed
Labels
area/cache kind/bug Something isn't working triage/needs-feedback We are waiting for feedback.

Comments

@israelstmz
Copy link

Describe the bug

We have multiple local caffeine caches in our web service. Per web request our application performs 6 external API calls. We use 6 caches to cache the web client responses (Uni).
Usage: @CacheResult(cacheName = "helper-service", lockTimeout = 20)
Config:

quarkus:
  cache:
    caffeine:
      "helper-service":
          initial-capacity: 50000
          maximum-size: 50000
          expire-after-write: 60m
          metrics-enabled: true

(we have 6 of those)
Out of dozens of app instances - we noticed a few of them crashing with OOM. Analysis led us to the growing cache. In the cache metrics (promethes ) we can see that some caches are exceeding the 50K limit, keep rising way above 200K:

# TYPE cache_size gauge
# HELP cache_size The number of entries in this cache. This may be an approximation, depending on the type of cache.
cache_size{cache="helper-service"} 213754.0
cache_size{cache="support-service"} 1513.0

This is the metrics of the cache sizes:
image
As can be seen - only one is greatly increasing. Its not a specific one, each time a different one.
Over time this increases the old-generation heap space until GC becomes inefficient causing an OOM:
image

The nature of our traffic per app is ~30 requests per sec with occasional spikes - doubling this amount.
We even added a quarkus scheduler - manually invalidating all the caches once in an hour:

@Inject
CacheManager cacheManager;

@ConfigProperty(name = "quarkus.cache.caffeine.maximum-size")
long MAX_CACHE_SIZE;

@Scheduled(every = "1h")
public void cleanCachesIfExceeded() {
    for (String cacheName : cacheManager.getCacheNames()) {
        cacheManager.getCache(cacheName)
                .ifPresent(this::checkAndClearCache);
    }
}

private void checkAndClearCache(Cache cache) {
    long cacheSize = getCacheSize(cache);
    if (cacheSize > MAX_CACHE_SIZE) {
        cache.invalidateAll()
                .subscribe().asCompletionStage();
    }
}

private long getCacheSize(Cache cache) {
    if (cache instanceof CaffeineCache caffeineCache) {
        return caffeineCache.keySet().size();
    }
    return 0;
}

But this too seem to only clean a few of the caches some of the times but still occasionally - a certain cache piles up and crashes the memory.
We are using native.

Expected behavior

We expect the cache size to stay within/around the 'maximum-size' limit consistently.

Actual behavior

On some random (?) app instances, occasional cache is growing infinitely until an OOM occures.

How to Reproduce?

Unfortunately we weren't able to reproduce this although we tried. It might be related to the high throughput/spikes in our production environment. As stated, in our production, too, it only happens on a few pods out of dozens...

Output of uname -a or ver

Linux 5.10.215-203.850.amzn2.x86_64 x86_64

Output of java -version

quarkus-micro-image:2.0

Quarkus version or git rev

3.14.4

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

No response

@israelstmz israelstmz added the kind/bug Something isn't working label Sep 19, 2024
Copy link

quarkus-bot bot commented Sep 19, 2024

/cc @gwenneg (cache)

@geoand
Copy link
Contributor

geoand commented Sep 19, 2024

Any chance you can put together a sample that reproduces the problem?

If not, more details on how you use Quarkus' caching would be great.

Thanks

@gsmet
Copy link
Member

gsmet commented Oct 1, 2024

Yes, we would at least need to know the cache implementation you are using (Caffeine, Redis, Infinispan? just saw you mentioned Caffeine in the title) and the configuration of the caches (you can anonymize the name if the names contain private info).

Also, if it's always the same caches causing problems might be useful information.

@gsmet gsmet added the triage/needs-feedback We are waiting for feedback. label Oct 1, 2024
@gsmet
Copy link
Member

gsmet commented Oct 1, 2024

I had a look at our cache code and I don't see us doing obvious crazy things, we are mostly relying on Caffeine's behavior.

Could you by any chance check a few things when the problem happens:

  • check that all the caches are properly configured with their respective maximum size
  • when your schedule task kicks in, could you log the status of the caches (configured max size, effective size and if it's doing something) - what bugs me a bit is that it seems you are using a global config property for the cache size in this scheduled task and it seems from the config at the beginning that you configure the cache max size individually?

Anyway, if we can't get a reproducer, we will need to try to diagnose more precisely what's going on and if it's Quarkus misconfiguring Caffeine or a Caffeine issue.

Probably also a good idea to check that setMaximumSize is never called on the caches (yeah, I know, but sometimes we forget about some code we added).

@ben-manes
Copy link

Its probably also good to capture a thread dump (jstack) or application profile (jfr). It sounds like the futures never complete so they pile up over time, which implies the executor is starved due to blocking calls. The default is ForkJoinPool.commonPool() which is often abused by CompletableFuture or parallel streams that block indefinitely, e.g. due to a lack of a network timeout (like -Dsun.net.client.defaultConnectTimeout=5000), so there are no threads available to make progress.

@vldnikitin
Copy link

vldnikitin commented Oct 11, 2024

We’re experiencing the same issue with some caches growing indefinitely in size on random pods.
Quarkus: 3.8.5
Cache configuration:

  quarkus:
    cache:
        caffeine:
          "my-cache":
            initial-capacity: 50
            maximum-size: 500
            expire-after-write: 60S
            metrics-enabled: true

Image
It’s interesting that other metrics like ‘invalidations,’ ‘cache puts,’ and ‘cache gets’ look normal, with no particular pod standing out. However, it seems like some entries remain in the cache and don’t get invalidated by the configured ‘expire-after-write’ property.
Before that we used Quarkus 2.16 and didn't notice any problems with the same configuration.

@gsmet
Copy link
Member

gsmet commented Oct 11, 2024

See my comment and @ben-manes 's comment above. We will need a lot more information to be able to track what's going on.

@geoand
Copy link
Contributor

geoand commented Oct 30, 2024

@israelstmz any updates based on the comments above?

@geoand
Copy link
Contributor

geoand commented Nov 11, 2024

Closing for lack of feeeback

@geoand geoand closed this as not planned Won't fix, can't repro, duplicate, stale Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cache kind/bug Something isn't working triage/needs-feedback We are waiting for feedback.
Projects
None yet
Development

No branches or pull requests

5 participants