Kubernetes client native integration test OOMs (GC overhead limit reached) with -Xmx5g #37142

jerboaa · 2023-11-16T11:02:10Z

Describe the bug

Since today in mandrel CI, the kubernetest-client native integration test OOMs with a GraalVM master build:

[INFO] --- quarkus:999-SNAPSHOT:build (default) @ quarkus-integration-test-kubernetes-client ---
Warning:  [io.quarkus.deployment.steps.NativeImageAllowIncompleteClasspathAggregateStep] The following extensions have required native-image to allow run-time resolution of classes: {quarkus-kubernetes-client}. This is a global requirement which might have unexpected effects on other extensions as well, and is a hint of the library needing some additional refactoring to better support GraalVM native-image. In the case of 3rd party dependencies and/or proprietary code there is not much we can do - please ask for support to your library vendor. If you incur in any problem with other Quarkus extensions, please try reproducing the problem without these extensions first.
Warning:  [io.quarkus.deployment.steps.ReflectiveHierarchyStep] Unable to properly register the hierarchy of the following classes for reflection as they are not in the Jandex index:
	- io.fabric8.openshift.api.model.operator.v1.GenerationStatus (source: JacksonProcessor > io.fabric8.kubernetes.api.model.ValidationSchema)
	- io.fabric8.openshift.api.model.operator.v1.OperatorCondition (source: JacksonProcessor > io.fabric8.kubernetes.api.model.ValidationSchema)
Consider adding them to the index either by creating a Jandex index for your dependency via the Maven plugin, an empty META-INF/beans.xml or quarkus.index-dependency properties.
[INFO] [io.quarkus.deployment.pkg.steps.JarResultBuildStep] Building native image source jar: /home/runner/work/mandrel/mandrel/quarkus/integration-tests/kubernetes-client/target/quarkus-integration-test-kubernetes-client-999-SNAPSHOT-native-image-source-jar/quarkus-integration-test-kubernetes-client-999-SNAPSHOT-runner.jar
[INFO] [io.quarkus.deployment.pkg.steps.NativeImageBuildStep] Building native image from /home/runner/work/mandrel/mandrel/quarkus/integration-tests/kubernetes-client/target/quarkus-integration-test-kubernetes-client-999-SNAPSHOT-native-image-source-jar/quarkus-integration-test-kubernetes-client-999-SNAPSHOT-runner.jar
[INFO] [io.quarkus.deployment.pkg.steps.NativeImageBuildStep] Running Quarkus native-image plugin on GRAALVM 24.0-dev JDK 22+22-jvmci-b01
[INFO] [io.quarkus.deployment.pkg.steps.NativeImageBuildRunner] /home/runner/work/mandrel/mandrel/graalvm-home/bin/native-image -J-Djava.util.logging.manager=org.jboss.logmanager.LogManager -J-Dsun.nio.ch.maxUpdateArraySize=100 -J-Dlogging.initial-configurator.min-level=500 -J-Dvertx.logger-delegate-factory-class-name=io.quarkus.vertx.core.runtime.VertxLogDelegateFactory -J-Dvertx.disableDnsResolver=true -J-Dio.netty.leakDetection.level=DISABLED -J-Dio.netty.allocator.maxOrder=3 -J-Duser.language=en -J-Duser.country=US -J-Dfile.encoding=UTF-8 --features=io.quarkus.runner.Feature,io.quarkus.runtime.graal.DisableLoggingFeature -J--add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED -J--add-opens=java.base/java.text=ALL-UNNAMED -J--add-opens=java.base/java.io=ALL-UNNAMED -J--add-opens=java.base/java.lang.invoke=ALL-UNNAMED -J--add-opens=java.base/java.util=ALL-UNNAMED -H:+UnlockExperimentalVMOptions -H:BuildOutputJSONFile=quarkus-integration-test-kubernetes-client-999-SNAPSHOT-runner-build-output-stats.json -H:-UnlockExperimentalVMOptions -H:+UnlockExperimentalVMOptions -H:+AllowFoldMethods -H:-UnlockExperimentalVMOptions -J-Djava.awt.headless=true --no-fallback -H:+UnlockExperimentalVMOptions -H:+ReportExceptionStackTraces -H:-UnlockExperimentalVMOptions -J-Xmx5g -H:-AddAllCharsets --enable-url-protocols=http,https -H:NativeLinkerOption=-no-pie --enable-monitoring=heapdump -H:+UnlockExperimentalVMOptions -H:-UseServiceLoaderFeature -H:-UnlockExperimentalVMOptions -J--add-exports=org.graalvm.nativeimage/org.graalvm.nativeimage.impl=ALL-UNNAMED --exclude-config io\.netty\.netty-codec /META-INF/native-image/io\.netty/netty-codec/generated/handlers/reflect-config\.json --exclude-config io\.netty\.netty-handler /META-INF/native-image/io\.netty/netty-handler/generated/handlers/reflect-config\.json quarkus-integration-test-kubernetes-client-999-SNAPSHOT-runner -jar quarkus-integration-test-kubernetes-client-999-SNAPSHOT-runner.jar
Warning: The option '-H:ReflectionConfigurationResources=META-INF/native-image/io.netty/netty-transport/reflection-config.json' is experimental and must be enabled via '-H:+UnlockExperimentalVMOptions' in the future.
Warning: Please re-evaluate whether any experimental option is required, and either remove or unlock it. The build output lists all active experimental options, including where they come from and possible alternatives. If you think an experimental option should be considered as stable, please file an issue.
========================================================================================================================
GraalVM Native Image: Generating 'quarkus-integration-test-kubernetes-client-999-SNAPSHOT-runner' (executable)...
========================================================================================================================
For detailed information and explanations on the build output, visit:
https://github.com/oracle/graal/blob/master/docs/reference-manual/native-image/BuildOutput.md
------------------------------------------------------------------------------------------------------------------------
[1/8] Initializing...                                                                                    (9.8s @ 0.28GB)
 Java version: 22+22, vendor version: GraalVM CE 22-dev+22.1
 Graal compiler: optimization level: 2, target machine: x86-64-v3
 C compiler: gcc (linux, x86_64, 11.4.0)
 Garbage collector: Serial GC (max heap size: 80% of RAM)
 4 user-specific feature(s):
 - com.oracle.svm.thirdparty.gson.GsonFeature
 - io.quarkus.runner.Feature: Auto-generated class by Quarkus from the existing extensions
 - io.quarkus.runtime.graal.DisableLoggingFeature: Disables INFO logging during the analysis phase
 - org.eclipse.angus.activation.nativeimage.AngusActivationFeature
------------------------------------------------------------------------------------------------------------------------
 4 experimental option(s) unlocked:
 - '-H:+AllowFoldMethods' (origin(s): command line)
 - '-H:BuildOutputJSONFile' (origin(s): command line)
 - '-H:-UseServiceLoaderFeature' (origin(s): command line)
 - '-H:ReflectionConfigurationResources': Use a reflect-config.json in your META-INF/native-image/<groupID>/<artifactID> directory instead. (origin(s): 'META-INF/native-image/io.netty/netty-transport/native-image.properties' in 'file:///home/runner/work/mandrel/mandrel/quarkus/integration-tests/kubernetes-client/target/quarkus-integration-test-kubernetes-client-999-SNAPSHOT-native-image-source-jar/lib/io.netty.netty-transport-4.1.100.Final.jar')
------------------------------------------------------------------------------------------------------------------------
Build resources:
 - 4.44GB of memory (28.5% of 15.61GB system memory, set via '-Xmx5g')
 - 4 thread(s) (100.0% of 4 available processor(s), determined at start)
os.name
[2/8] Performing analysis...  [******]                                                                 (283.1s @ 3.83GB)
   27,338 reachable types   (92.9% of   29,417 total)
   58,060 reachable fields  (79.6% of   72,964 total)
  277,512 reachable methods (82.6% of  335,942 total)
   16,468 types, 34,206 fields, and 163,562 methods registered for reflection
       61 types,    61 fields, and    55 methods registered for JNI access
        4 native libraries: dl, pthread, rt, z
Terminating due to java.lang.OutOfMemoryError: GC overhead limit exceeded
The Native Image build process ran out of memory.
Please make sure your build system has more memory available.
[INFO]

See:
https://github.com/graalvm/mandrel/actions/runs/6885132117/job/18729490442#step:12:207

It looks like the types reached and methods reached have increased significantly from a run that last worked 2 days ago here:
https://github.com/graalvm/mandrel/actions/runs/6885132117/job/18729490442#step:12:207

The specifics are as follows:

GOOD

    18,450 reachable types   (74.8% of   24,659 total)
   36,922 reachable fields  (71.3% of   51,784 total)
  132,466 reachable methods (66.7% of  198,694 total)
    7,534 types,   779 fields, and 43,505 methods registered for reflection
       61 types,    61 fields, and    55 methods registered for JNI access

BAD

   27,338 reachable types   (92.9% of   29,417 total)
   58,060 reachable fields  (79.6% of   72,964 total)
  277,512 reachable methods (82.6% of  335,942 total)
   16,468 types, 34,206 fields, and 163,562 methods registered for reflection
       61 types,    61 fields, and    55 methods registered for JNI access

Was there a change recently which could have caused this?

It's also concerning that we now see this (not in the passing test):

 Warning:  [io.quarkus.deployment.steps.NativeImageAllowIncompleteClasspathAggregateStep] The following extensions have required native-image to allow run-time resolution of classes: {quarkus-kubernetes-client}. This is a global requirement which might have unexpected effects on other extensions as well, and is a hint of the library needing some additional refactoring to better support GraalVM native-image. In the case of 3rd party dependencies and/or proprietary code there is not much we can do - please ask for support to your library vendor. If you incur in any problem with other Quarkus extensions, please try reproducing the problem without these extensions first.
Warning:  [io.quarkus.deployment.steps.ReflectiveHierarchyStep] Unable to properly register the hierarchy of the following classes for reflection as they are not in the Jandex index:
	- io.fabric8.openshift.api.model.operator.v1.GenerationStatus (source: JacksonProcessor > io.fabric8.kubernetes.api.model.ValidationSchema)
	- io.fabric8.openshift.api.model.operator.v1.OperatorCondition (source: JacksonProcessor > io.fabric8.kubernetes.api.model.ValidationSchema)
Consider adding them to the index either by creating a Jandex index for your dependency via the Maven plugin, an empty META-INF/beans.xml or quarkus.index-dependency properties.

The text was updated successfully, but these errors were encountered:

quarkus-bot · 2023-11-16T11:02:14Z

/cc @Karm (mandrel), @galderz (mandrel), @geoand (kubernetes,openshift), @iocanel (kubernetes,openshift), @zakkak (mandrel,native-image)

zakkak · 2023-11-16T11:08:21Z

FTR: the same test builds fine with Mandrel but still has the increased reachable types.

https://github.com/graalvm/mandrel/actions/runs/6885132117/job/18729383262#step:12:201

Was there a change recently which could have caused this?

#36312 was merged 2 days ago, so it could be related (not verified).

geoand · 2023-11-16T11:09:57Z

cc @manusa

manusa · 2023-11-16T11:30:15Z

Sorry, I'm not familiar with the graalvm/mandrel pipelines.

Does this mean that even with the dependency exclusions (and hack extension to remove the link check) GraalVM is still running out of memory?

manusa · 2023-11-16T11:39:09Z

If this is the case, there are a few other unused modules that could be excluded too (these are the larger ones):

1.9M openshift-model-config
1.8M openshift-model-hive
1.5M openshift-model-monitoring
1.1M kubernetes-model-admissionregistration
525K openshift-model-machine

jerboaa · 2023-11-16T16:03:47Z

We could bump the memory limit, but since the static analysis results differ wildly, I'd rather we do some investigation of whether or not this could be reduced. The end result will also influence image size. To back that up with some numbers: I see 208.51MB total image size, while it was 90.42MB last week with essentially the same mandrel version (23.1). So I think this definitely is worth investigating. Using the 23.1 mandrel release should be sufficient.

Compare before #36312 and after.

fedinskiy · 2023-11-20T10:57:16Z

We(Quarkus QE) see this bug on mandrel as well. Reproducer:

git clone git@github.com:quarkus-qe/quarkus-test-suite.git
cd quarkus-test-suite
mvn clean verify -P root-modules -D native -pl funqy/knative-events/ -Dquarkus.platform.version=3.5.2
# This works
mvn clean verify -P root-modules -D native -pl funqy/knative-events/ # uses 999-SNAPSHOT, hangs for 40 minutes, then fail

Logs output:

 Java version: 17.0.9+9, vendor version: Mandrel-23.0.2.1-Final
 Graal compiler: optimization level: 2, target machine: x86-64-v3
 C compiler: gcc (redhat, x86_64, 8.5.0)
 Garbage collector: Serial GC (max heap size: 80% of RAM)

jerboaa · 2023-11-20T11:05:04Z

Thanks. It doesn't seem GraalVM CE/Mandrel related. We see it in CI sometimes passing (producing a 200MB binary!) and sometimes failing (Watchdog timeout or GC overhead limit).

michalvavrik · 2023-11-21T08:44:12Z

Thanks. It doesn't seem GraalVM CE/Mandrel related. We see it in CI sometimes passing (producing a 200MB binary!) and sometimes failing (Watchdog timeout or GC overhead limit).

This is not CI related. I just reproduced it on my very strong workstation. There is actual serious issue.

michalvavrik · 2023-11-21T08:46:35Z

Although The Native Image build process ran out of memory. (The maximum heap size of the process was set with '-Xmx4g'.) there was default maximum set somehow.

michalvavrik · 2023-11-21T08:57:15Z

Sorry, I just tried it with quarkus.native.native-image-xmx=10g and it works. So it's like previous comments here says - Kubernetes client takes too much now.

zakkak · 2023-11-21T14:40:46Z

I had a better look at #36312 and although I don't have a solution I see the following things that are concerning and are all related to getting the CI happy:

The PR increases the max heap size used in the CI (so the issue @jerboaa actually reported here was already caught during the PR testing).
The PR "hacks" the kubernetes-client integration test to exclude specific dependencies and also allow an incomplete classpath

The above changes apply only on the test meaning that what we test is not what the users will use. Furthermore, allowing an incomplete class path opens the doors for issues related to reachable but not included code going unnoticed.

Removing the make-ci-happy changes I get the following numbers that are even worse to the ones @jerboaa reported:

   35,548 reachable types   (94.4% of   37,641 total)
   75,427 reachable fields  (79.0% of   95,511 total)
  343,186 reachable methods (85.5% of  401,567 total)
   24,759 types, 51,561 fields, and 215,013 methods registered for reflection
       61 types,    61 fields, and    55 methods registered for JNI access
...
 131.64MB (48.15%) for code area:   305,363 compilation units
 141.43MB (51.72%) for image heap:1,024,956 objects and 156 resources
 364.20kB ( 0.13%) for other data
 273.43MB in total

which is probably closer to what Quarkus users will get, while before #36312 it was

    17,374 reachable types   (74.1% of   23,446 total)
   34,042 reachable fields  (71.1% of   47,886 total)
  125,928 reachable methods (67.0% of  188,051 total)
    7,162 types, 1,163 fields, and 43,454 methods registered for reflection
       61 types,    59 fields, and    55 methods registered for JNI access
...
  43.61MB (48.22%) for code area:    92,240 compilation units
  46.42MB (51.33%) for image heap:  465,045 objects and 59 resources
 410.02kB ( 0.44%) for other data
  90.43MB in total

I agree with @jerboaa that the most important thing here is not the higher resource utilization at build time, but why the same test now requires so much more data in the binary (which is probably related with the resource utilization at build time).

If this is the case, there are a few other unused modules that could be excluded too (these are the larger ones):

If they are unused only in the test, that's not the right way to go. If they are generally unused we should find a way to exclude them in general not only in the test.

I will try to do some analysis but I don't know what the ETA will be.

quarkusio/quarkus#37142

manusa · 2023-11-23T06:22:35Z

If they are unused only in the test, that's not the right way to go. If they are generally unused we should find a way to exclude them in general not only in the test.

Let me give a little bit of context so to make things clearer.

The io.fabric8:openshift-client module includes transitive dependencies to all of the openshift-model-xxx modules that contain the typed classes for all OpenShift managed objects.

The OpenShiftClient interface and its main implementation OpenShiftClientImpl contain methods such as public MixedOperation<BareMetalHost, BareMetalHostList, Resource<BareMetalHost>> bareMetalHosts() which provide easy access to those resources.

For example, one can easily query bareMetalHosts by doing openshiftClient.bareMetalHosts().list() or even more complex operations such as:

client.bareMetalHosts().withName("the-name")
  .edit(host -> new BareMetalHostBuilder(host).editMetadata().addToAnnotations("foo", "bar").endMetadata().build());

in a single statement.

The problem is that since these methods reference classes from those modules, even if the user knows they aren't going to use them and excludes the model modules, the native image compiling process still complains about the unlinked classes.

We're tracking this at fabric8io/kubernetes-client#5592

So besides what's proposed in #37278, other options would include breaking down the openshift-client into smaller clients.

quarkusio/quarkus#37142

fedinskiy · 2024-02-14T13:18:27Z

@manusa in the comments to fabric8 issue you said[1], that you're planning a Quarkus-specific fix for the bug. Did it materialize? The issue still affects us as of 3.7.2.

[1] fabric8io/kubernetes-client#5592 (comment)

manusa · 2024-02-14T13:40:43Z

I can't remember now what I exactly meant with that comment (I should have elaborated more).

What I can remember is that I added the MiscellaneousSubstitutions and OperatorSubstitutions that allow for the <exclusions> to work in native mode.

With the current state of the client and the Quarkus extensions, I'm not sure there's something else that preserves the current API. The only options that I can think of are more aggressive.
Anyway, let's discuss this later and see if we can find a suitable solution.

zakkak · 2024-03-27T12:41:11Z

This was more extensively discussed in #38683 and is now fixed by #38886 and fabric8io/kubernetes-client#5759

jerboaa added the kind/bug Something isn't working label Nov 16, 2023

quarkus-bot bot added area/kubernetes area/native-image labels Nov 16, 2023

jerboaa mentioned this issue Nov 16, 2023

[CI] Quarkus main fails with Java 22 mandrel build of graal/master on Linux graalvm/mandrel#580

Closed

jerboaa changed the title ~~[GraalVM 24.0] Kubernetes client native integration test OOMs (GC overhead limit reached) with -Xmx5g~~ Kubernetes client native integration test OOMs (GC overhead limit reached) with -Xmx5g Nov 16, 2023

michalvavrik mentioned this issue Nov 21, 2023

Lower maximum Java heap size in GH native daily to 5g quarkus-qe/quarkus-test-suite#1534

Merged

9 tasks

fedinskiy added a commit to fedinskiy/quarkus-test-suite that referenced this issue Nov 22, 2023

Disable native tests for funqy due to upstream issue

54275e4

quarkusio/quarkus#37142

fedinskiy mentioned this issue Nov 22, 2023

Disable native tests for funqy due to upstream issue quarkus-qe/quarkus-test-suite#1539

Merged

9 tasks

manusa mentioned this issue Nov 23, 2023

Allow for model module exclusions in native mode #37278

Merged

fedinskiy added a commit to fedinskiy/quarkus-test-suite that referenced this issue Nov 23, 2023

Disable native tests for funqy due to upstream issue

5a98b55

quarkusio/quarkus#37142

fedinskiy added a commit to fedinskiy/quarkus-test-suite that referenced this issue Nov 23, 2023

Disable native tests for funqy due to upstream issue

65b95ae

quarkusio/quarkus#37142

zakkak mentioned this issue Feb 8, 2024

Build time performance regression and bigger native binaries when migrating from 3.5 to 3.6 or 3.7 #38683

Closed

fedinskiy mentioned this issue Feb 14, 2024

Enable disabled tests and stop using deploymentconfig quarkus-qe/quarkus-test-suite#1645

Merged

9 tasks

zakkak closed this as completed Mar 27, 2024

fedinskiy mentioned this issue Sep 18, 2024

Native compilation for openshift and knative consumes 4 times more RAM and takes 4 times longer #43360

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes client native integration test OOMs (GC overhead limit reached) with -Xmx5g #37142

Kubernetes client native integration test OOMs (GC overhead limit reached) with -Xmx5g #37142

jerboaa commented Nov 16, 2023 •

edited

Loading

quarkus-bot bot commented Nov 16, 2023

zakkak commented Nov 16, 2023

geoand commented Nov 16, 2023

manusa commented Nov 16, 2023

manusa commented Nov 16, 2023

jerboaa commented Nov 16, 2023 •

edited

Loading

fedinskiy commented Nov 20, 2023

jerboaa commented Nov 20, 2023

michalvavrik commented Nov 21, 2023 •

edited

Loading

michalvavrik commented Nov 21, 2023

michalvavrik commented Nov 21, 2023

zakkak commented Nov 21, 2023 •

edited

Loading

manusa commented Nov 23, 2023 •

edited

Loading

fedinskiy commented Feb 14, 2024

manusa commented Feb 14, 2024

zakkak commented Mar 27, 2024

Kubernetes client native integration test OOMs (GC overhead limit reached) with -Xmx5g #37142

Kubernetes client native integration test OOMs (GC overhead limit reached) with -Xmx5g #37142

Comments

jerboaa commented Nov 16, 2023 • edited Loading

Describe the bug

quarkus-bot bot commented Nov 16, 2023

zakkak commented Nov 16, 2023

geoand commented Nov 16, 2023

manusa commented Nov 16, 2023

manusa commented Nov 16, 2023

jerboaa commented Nov 16, 2023 • edited Loading

fedinskiy commented Nov 20, 2023

jerboaa commented Nov 20, 2023

michalvavrik commented Nov 21, 2023 • edited Loading

michalvavrik commented Nov 21, 2023

michalvavrik commented Nov 21, 2023

zakkak commented Nov 21, 2023 • edited Loading

manusa commented Nov 23, 2023 • edited Loading

fedinskiy commented Feb 14, 2024

manusa commented Feb 14, 2024

zakkak commented Mar 27, 2024

jerboaa commented Nov 16, 2023 •

edited

Loading

jerboaa commented Nov 16, 2023 •

edited

Loading

michalvavrik commented Nov 21, 2023 •

edited

Loading

zakkak commented Nov 21, 2023 •

edited

Loading

manusa commented Nov 23, 2023 •

edited

Loading