Caffeine (Async) Cache seems to ignore 'maximum size' on a certain scenario #43382

israelstmz · 2024-09-19T07:29:32Z

Describe the bug

We have multiple local caffeine caches in our web service. Per web request our application performs 6 external API calls. We use 6 caches to cache the web client responses (Uni).
Usage: @CacheResult(cacheName = "helper-service", lockTimeout = 20)
Config:

quarkus:
  cache:
    caffeine:
      "helper-service":
          initial-capacity: 50000
          maximum-size: 50000
          expire-after-write: 60m
          metrics-enabled: true

(we have 6 of those)
Out of dozens of app instances - we noticed a few of them crashing with OOM. Analysis led us to the growing cache. In the cache metrics (promethes ) we can see that some caches are exceeding the 50K limit, keep rising way above 200K:

# TYPE cache_size gauge
# HELP cache_size The number of entries in this cache. This may be an approximation, depending on the type of cache.
cache_size{cache="helper-service"} 213754.0
cache_size{cache="support-service"} 1513.0

This is the metrics of the cache sizes:

As can be seen - only one is greatly increasing. Its not a specific one, each time a different one.
Over time this increases the old-generation heap space until GC becomes inefficient causing an OOM:

The nature of our traffic per app is ~30 requests per sec with occasional spikes - doubling this amount.
We even added a quarkus scheduler - manually invalidating all the caches once in an hour:

@Inject
CacheManager cacheManager;

@ConfigProperty(name = "quarkus.cache.caffeine.maximum-size")
long MAX_CACHE_SIZE;

@Scheduled(every = "1h")
public void cleanCachesIfExceeded() {
    for (String cacheName : cacheManager.getCacheNames()) {
        cacheManager.getCache(cacheName)
                .ifPresent(this::checkAndClearCache);
    }
}

private void checkAndClearCache(Cache cache) {
    long cacheSize = getCacheSize(cache);
    if (cacheSize > MAX_CACHE_SIZE) {
        cache.invalidateAll()
                .subscribe().asCompletionStage();
    }
}

private long getCacheSize(Cache cache) {
    if (cache instanceof CaffeineCache caffeineCache) {
        return caffeineCache.keySet().size();
    }
    return 0;
}

But this too seem to only clean a few of the caches some of the times but still occasionally - a certain cache piles up and crashes the memory.
We are using native.

Expected behavior

We expect the cache size to stay within/around the 'maximum-size' limit consistently.

Actual behavior

On some random (?) app instances, occasional cache is growing infinitely until an OOM occures.

How to Reproduce?

Unfortunately we weren't able to reproduce this although we tried. It might be related to the high throughput/spikes in our production environment. As stated, in our production, too, it only happens on a few pods out of dozens...

Output of `uname -a` or `ver`

Linux 5.10.215-203.850.amzn2.x86_64 x86_64

Output of `java -version`

quarkus-micro-image:2.0

Quarkus version or git rev

3.14.4

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

No response

Additional information

No response

The text was updated successfully, but these errors were encountered:

quarkus-bot · 2024-09-19T07:29:36Z

/cc @gwenneg (cache)

geoand · 2024-09-19T11:45:58Z

Any chance you can put together a sample that reproduces the problem?

If not, more details on how you use Quarkus' caching would be great.

Thanks

gsmet · 2024-10-01T14:46:51Z

Yes, we would at least need to know the cache implementation you are using (~~Caffeine, Redis, Infinispan?~~ just saw you mentioned Caffeine in the title) and the configuration of the caches (you can anonymize the name if the names contain private info).

Also, if it's always the same caches causing problems might be useful information.

gsmet · 2024-10-01T15:26:07Z

I had a look at our cache code and I don't see us doing obvious crazy things, we are mostly relying on Caffeine's behavior.

Could you by any chance check a few things when the problem happens:

check that all the caches are properly configured with their respective maximum size
when your schedule task kicks in, could you log the status of the caches (configured max size, effective size and if it's doing something) - what bugs me a bit is that it seems you are using a global config property for the cache size in this scheduled task and it seems from the config at the beginning that you configure the cache max size individually?

Anyway, if we can't get a reproducer, we will need to try to diagnose more precisely what's going on and if it's Quarkus misconfiguring Caffeine or a Caffeine issue.

Probably also a good idea to check that setMaximumSize is never called on the caches (yeah, I know, but sometimes we forget about some code we added).

ben-manes · 2024-10-01T15:38:57Z

Its probably also good to capture a thread dump (jstack) or application profile (jfr). It sounds like the futures never complete so they pile up over time, which implies the executor is starved due to blocking calls. The default is ForkJoinPool.commonPool() which is often abused by CompletableFuture or parallel streams that block indefinitely, e.g. due to a lack of a network timeout (like -Dsun.net.client.defaultConnectTimeout=5000), so there are no threads available to make progress.

vldnikitin · 2024-10-11T11:21:43Z

We’re experiencing the same issue with some caches growing indefinitely in size on random pods.
Quarkus: 3.8.5
Cache configuration:

  quarkus:
    cache:
        caffeine:
          "my-cache":
            initial-capacity: 50
            maximum-size: 500
            expire-after-write: 60S
            metrics-enabled: true

It’s interesting that other metrics like ‘invalidations,’ ‘cache puts,’ and ‘cache gets’ look normal, with no particular pod standing out. However, it seems like some entries remain in the cache and don’t get invalidated by the configured ‘expire-after-write’ property.
Before that we used Quarkus 2.16 and didn't notice any problems with the same configuration.

gsmet · 2024-10-11T15:15:42Z

See my comment and @ben-manes 's comment above. We will need a lot more information to be able to track what's going on.

geoand · 2024-10-30T08:51:46Z

@israelstmz any updates based on the comments above?

geoand · 2024-11-11T13:28:01Z

Closing for lack of feeeback

israelstmz added the kind/bug Something isn't working label Sep 19, 2024

quarkus-bot bot added the area/cache label Sep 19, 2024

gsmet added the triage/needs-feedback We are waiting for feedback. label Oct 1, 2024

geoand closed this as not planned Won't fix, can't repro, duplicate, stale Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caffeine (Async) Cache seems to ignore 'maximum size' on a certain scenario #43382

Caffeine (Async) Cache seems to ignore 'maximum size' on a certain scenario #43382

israelstmz commented Sep 19, 2024

quarkus-bot bot commented Sep 19, 2024

geoand commented Sep 19, 2024

gsmet commented Oct 1, 2024 •

edited

Loading

gsmet commented Oct 1, 2024

ben-manes commented Oct 1, 2024

vldnikitin commented Oct 11, 2024 •

edited

Loading

gsmet commented Oct 11, 2024

geoand commented Oct 30, 2024

geoand commented Nov 11, 2024

Caffeine (Async) Cache seems to ignore 'maximum size' on a certain scenario #43382

Caffeine (Async) Cache seems to ignore 'maximum size' on a certain scenario #43382

Comments

israelstmz commented Sep 19, 2024

Describe the bug

Expected behavior

Actual behavior

How to Reproduce?

Output of uname -a or ver

Output of java -version

Quarkus version or git rev

Build tool (ie. output of mvnw --version or gradlew --version)

Additional information

quarkus-bot bot commented Sep 19, 2024

geoand commented Sep 19, 2024

gsmet commented Oct 1, 2024 • edited Loading

gsmet commented Oct 1, 2024

ben-manes commented Oct 1, 2024

vldnikitin commented Oct 11, 2024 • edited Loading

gsmet commented Oct 11, 2024

geoand commented Oct 30, 2024

geoand commented Nov 11, 2024

Output of `uname -a` or `ver`

Output of `java -version`

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

gsmet commented Oct 1, 2024 •

edited

Loading

vldnikitin commented Oct 11, 2024 •

edited

Loading