Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes client native integration test OOMs (GC overhead limit reached) with -Xmx5g #37142

Closed
jerboaa opened this issue Nov 16, 2023 · 16 comments

Comments

@jerboaa
Copy link
Contributor

jerboaa commented Nov 16, 2023

Describe the bug

Since today in mandrel CI, the kubernetest-client native integration test OOMs with a GraalVM master build:

[INFO] --- quarkus:999-SNAPSHOT:build (default) @ quarkus-integration-test-kubernetes-client ---
Warning:  [io.quarkus.deployment.steps.NativeImageAllowIncompleteClasspathAggregateStep] The following extensions have required native-image to allow run-time resolution of classes: {quarkus-kubernetes-client}. This is a global requirement which might have unexpected effects on other extensions as well, and is a hint of the library needing some additional refactoring to better support GraalVM native-image. In the case of 3rd party dependencies and/or proprietary code there is not much we can do - please ask for support to your library vendor. If you incur in any problem with other Quarkus extensions, please try reproducing the problem without these extensions first.
Warning:  [io.quarkus.deployment.steps.ReflectiveHierarchyStep] Unable to properly register the hierarchy of the following classes for reflection as they are not in the Jandex index:
	- io.fabric8.openshift.api.model.operator.v1.GenerationStatus (source: JacksonProcessor > io.fabric8.kubernetes.api.model.ValidationSchema)
	- io.fabric8.openshift.api.model.operator.v1.OperatorCondition (source: JacksonProcessor > io.fabric8.kubernetes.api.model.ValidationSchema)
Consider adding them to the index either by creating a Jandex index for your dependency via the Maven plugin, an empty META-INF/beans.xml or quarkus.index-dependency properties.
[INFO] [io.quarkus.deployment.pkg.steps.JarResultBuildStep] Building native image source jar: /home/runner/work/mandrel/mandrel/quarkus/integration-tests/kubernetes-client/target/quarkus-integration-test-kubernetes-client-999-SNAPSHOT-native-image-source-jar/quarkus-integration-test-kubernetes-client-999-SNAPSHOT-runner.jar
[INFO] [io.quarkus.deployment.pkg.steps.NativeImageBuildStep] Building native image from /home/runner/work/mandrel/mandrel/quarkus/integration-tests/kubernetes-client/target/quarkus-integration-test-kubernetes-client-999-SNAPSHOT-native-image-source-jar/quarkus-integration-test-kubernetes-client-999-SNAPSHOT-runner.jar
[INFO] [io.quarkus.deployment.pkg.steps.NativeImageBuildStep] Running Quarkus native-image plugin on GRAALVM 24.0-dev JDK 22+22-jvmci-b01
[INFO] [io.quarkus.deployment.pkg.steps.NativeImageBuildRunner] /home/runner/work/mandrel/mandrel/graalvm-home/bin/native-image -J-Djava.util.logging.manager=org.jboss.logmanager.LogManager -J-Dsun.nio.ch.maxUpdateArraySize=100 -J-Dlogging.initial-configurator.min-level=500 -J-Dvertx.logger-delegate-factory-class-name=io.quarkus.vertx.core.runtime.VertxLogDelegateFactory -J-Dvertx.disableDnsResolver=true -J-Dio.netty.leakDetection.level=DISABLED -J-Dio.netty.allocator.maxOrder=3 -J-Duser.language=en -J-Duser.country=US -J-Dfile.encoding=UTF-8 --features=io.quarkus.runner.Feature,io.quarkus.runtime.graal.DisableLoggingFeature -J--add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED -J--add-opens=java.base/java.text=ALL-UNNAMED -J--add-opens=java.base/java.io=ALL-UNNAMED -J--add-opens=java.base/java.lang.invoke=ALL-UNNAMED -J--add-opens=java.base/java.util=ALL-UNNAMED -H:+UnlockExperimentalVMOptions -H:BuildOutputJSONFile=quarkus-integration-test-kubernetes-client-999-SNAPSHOT-runner-build-output-stats.json -H:-UnlockExperimentalVMOptions -H:+UnlockExperimentalVMOptions -H:+AllowFoldMethods -H:-UnlockExperimentalVMOptions -J-Djava.awt.headless=true --no-fallback -H:+UnlockExperimentalVMOptions -H:+ReportExceptionStackTraces -H:-UnlockExperimentalVMOptions -J-Xmx5g -H:-AddAllCharsets --enable-url-protocols=http,https -H:NativeLinkerOption=-no-pie --enable-monitoring=heapdump -H:+UnlockExperimentalVMOptions -H:-UseServiceLoaderFeature -H:-UnlockExperimentalVMOptions -J--add-exports=org.graalvm.nativeimage/org.graalvm.nativeimage.impl=ALL-UNNAMED --exclude-config io\.netty\.netty-codec /META-INF/native-image/io\.netty/netty-codec/generated/handlers/reflect-config\.json --exclude-config io\.netty\.netty-handler /META-INF/native-image/io\.netty/netty-handler/generated/handlers/reflect-config\.json quarkus-integration-test-kubernetes-client-999-SNAPSHOT-runner -jar quarkus-integration-test-kubernetes-client-999-SNAPSHOT-runner.jar
Warning: The option '-H:ReflectionConfigurationResources=META-INF/native-image/io.netty/netty-transport/reflection-config.json' is experimental and must be enabled via '-H:+UnlockExperimentalVMOptions' in the future.
Warning: Please re-evaluate whether any experimental option is required, and either remove or unlock it. The build output lists all active experimental options, including where they come from and possible alternatives. If you think an experimental option should be considered as stable, please file an issue.
========================================================================================================================
GraalVM Native Image: Generating 'quarkus-integration-test-kubernetes-client-999-SNAPSHOT-runner' (executable)...
========================================================================================================================
For detailed information and explanations on the build output, visit:
https://github.com/oracle/graal/blob/master/docs/reference-manual/native-image/BuildOutput.md
------------------------------------------------------------------------------------------------------------------------
[1/8] Initializing...                                                                                    (9.8s @ 0.28GB)
 Java version: 22+22, vendor version: GraalVM CE 22-dev+22.1
 Graal compiler: optimization level: 2, target machine: x86-64-v3
 C compiler: gcc (linux, x86_64, 11.4.0)
 Garbage collector: Serial GC (max heap size: 80% of RAM)
 4 user-specific feature(s):
 - com.oracle.svm.thirdparty.gson.GsonFeature
 - io.quarkus.runner.Feature: Auto-generated class by Quarkus from the existing extensions
 - io.quarkus.runtime.graal.DisableLoggingFeature: Disables INFO logging during the analysis phase
 - org.eclipse.angus.activation.nativeimage.AngusActivationFeature
------------------------------------------------------------------------------------------------------------------------
 4 experimental option(s) unlocked:
 - '-H:+AllowFoldMethods' (origin(s): command line)
 - '-H:BuildOutputJSONFile' (origin(s): command line)
 - '-H:-UseServiceLoaderFeature' (origin(s): command line)
 - '-H:ReflectionConfigurationResources': Use a reflect-config.json in your META-INF/native-image/<groupID>/<artifactID> directory instead. (origin(s): 'META-INF/native-image/io.netty/netty-transport/native-image.properties' in 'file:///home/runner/work/mandrel/mandrel/quarkus/integration-tests/kubernetes-client/target/quarkus-integration-test-kubernetes-client-999-SNAPSHOT-native-image-source-jar/lib/io.netty.netty-transport-4.1.100.Final.jar')
------------------------------------------------------------------------------------------------------------------------
Build resources:
 - 4.44GB of memory (28.5% of 15.61GB system memory, set via '-Xmx5g')
 - 4 thread(s) (100.0% of 4 available processor(s), determined at start)
os.name
[2/8] Performing analysis...  [******]                                                                 (283.1s @ 3.83GB)
   27,338 reachable types   (92.9% of   29,417 total)
   58,060 reachable fields  (79.6% of   72,964 total)
  277,512 reachable methods (82.6% of  335,942 total)
   16,468 types, 34,206 fields, and 163,562 methods registered for reflection
       61 types,    61 fields, and    55 methods registered for JNI access
        4 native libraries: dl, pthread, rt, z
Terminating due to java.lang.OutOfMemoryError: GC overhead limit exceeded
The Native Image build process ran out of memory.
Please make sure your build system has more memory available.
[INFO] 

See:
https://github.com/graalvm/mandrel/actions/runs/6885132117/job/18729490442#step:12:207

It looks like the types reached and methods reached have increased significantly from a run that last worked 2 days ago here:
https://github.com/graalvm/mandrel/actions/runs/6885132117/job/18729490442#step:12:207

The specifics are as follows:

GOOD

    18,450 reachable types   (74.8% of   24,659 total)
   36,922 reachable fields  (71.3% of   51,784 total)
  132,466 reachable methods (66.7% of  198,694 total)
    7,534 types,   779 fields, and 43,505 methods registered for reflection
       61 types,    61 fields, and    55 methods registered for JNI access

BAD

   27,338 reachable types   (92.9% of   29,417 total)
   58,060 reachable fields  (79.6% of   72,964 total)
  277,512 reachable methods (82.6% of  335,942 total)
   16,468 types, 34,206 fields, and 163,562 methods registered for reflection
       61 types,    61 fields, and    55 methods registered for JNI access

Was there a change recently which could have caused this?

It's also concerning that we now see this (not in the passing test):

 Warning:  [io.quarkus.deployment.steps.NativeImageAllowIncompleteClasspathAggregateStep] The following extensions have required native-image to allow run-time resolution of classes: {quarkus-kubernetes-client}. This is a global requirement which might have unexpected effects on other extensions as well, and is a hint of the library needing some additional refactoring to better support GraalVM native-image. In the case of 3rd party dependencies and/or proprietary code there is not much we can do - please ask for support to your library vendor. If you incur in any problem with other Quarkus extensions, please try reproducing the problem without these extensions first.
Warning:  [io.quarkus.deployment.steps.ReflectiveHierarchyStep] Unable to properly register the hierarchy of the following classes for reflection as they are not in the Jandex index:
	- io.fabric8.openshift.api.model.operator.v1.GenerationStatus (source: JacksonProcessor > io.fabric8.kubernetes.api.model.ValidationSchema)
	- io.fabric8.openshift.api.model.operator.v1.OperatorCondition (source: JacksonProcessor > io.fabric8.kubernetes.api.model.ValidationSchema)
Consider adding them to the index either by creating a Jandex index for your dependency via the Maven plugin, an empty META-INF/beans.xml or quarkus.index-dependency properties.
@jerboaa jerboaa added the kind/bug Something isn't working label Nov 16, 2023
Copy link

quarkus-bot bot commented Nov 16, 2023

/cc @Karm (mandrel), @galderz (mandrel), @geoand (kubernetes,openshift), @iocanel (kubernetes,openshift), @zakkak (mandrel,native-image)

@zakkak
Copy link
Contributor

zakkak commented Nov 16, 2023

FTR: the same test builds fine with Mandrel but still has the increased reachable types.

https://github.com/graalvm/mandrel/actions/runs/6885132117/job/18729383262#step:12:201

Was there a change recently which could have caused this?

#36312 was merged 2 days ago, so it could be related (not verified).

@geoand
Copy link
Contributor

geoand commented Nov 16, 2023

cc @manusa

@manusa
Copy link
Contributor

manusa commented Nov 16, 2023

Sorry, I'm not familiar with the graalvm/mandrel pipelines.

Does this mean that even with the dependency exclusions (and hack extension to remove the link check) GraalVM is still running out of memory?

@manusa
Copy link
Contributor

manusa commented Nov 16, 2023

If this is the case, there are a few other unused modules that could be excluded too (these are the larger ones):

1.9M openshift-model-config
1.8M openshift-model-hive
1.5M openshift-model-monitoring
1.1M kubernetes-model-admissionregistration
525K openshift-model-machine

@jerboaa
Copy link
Contributor Author

jerboaa commented Nov 16, 2023

We could bump the memory limit, but since the static analysis results differ wildly, I'd rather we do some investigation of whether or not this could be reduced. The end result will also influence image size. To back that up with some numbers: I see 208.51MB total image size, while it was 90.42MB last week with essentially the same mandrel version (23.1). So I think this definitely is worth investigating. Using the 23.1 mandrel release should be sufficient.

Compare before #36312 and after.

@jerboaa jerboaa changed the title [GraalVM 24.0] Kubernetes client native integration test OOMs (GC overhead limit reached) with -Xmx5g Kubernetes client native integration test OOMs (GC overhead limit reached) with -Xmx5g Nov 16, 2023
@fedinskiy
Copy link
Contributor

We(Quarkus QE) see this bug on mandrel as well. Reproducer:

git clone git@github.com:quarkus-qe/quarkus-test-suite.git
cd quarkus-test-suite
mvn clean verify -P root-modules -D native -pl funqy/knative-events/ -Dquarkus.platform.version=3.5.2
# This works
mvn clean verify -P root-modules -D native -pl funqy/knative-events/ # uses 999-SNAPSHOT, hangs for 40 minutes, then fail

Logs output:

 Java version: 17.0.9+9, vendor version: Mandrel-23.0.2.1-Final
 Graal compiler: optimization level: 2, target machine: x86-64-v3
 C compiler: gcc (redhat, x86_64, 8.5.0)
 Garbage collector: Serial GC (max heap size: 80% of RAM)

@jerboaa
Copy link
Contributor Author

jerboaa commented Nov 20, 2023

Thanks. It doesn't seem GraalVM CE/Mandrel related. We see it in CI sometimes passing (producing a 200MB binary!) and sometimes failing (Watchdog timeout or GC overhead limit).

@michalvavrik
Copy link
Member

michalvavrik commented Nov 21, 2023

Thanks. It doesn't seem GraalVM CE/Mandrel related. We see it in CI sometimes passing (producing a 200MB binary!) and sometimes failing (Watchdog timeout or GC overhead limit).

This is not CI related. I just reproduced it on my very strong workstation. There is actual serious issue.

@michalvavrik
Copy link
Member

Although The Native Image build process ran out of memory. (The maximum heap size of the process was set with '-Xmx4g'.) there was default maximum set somehow.

@michalvavrik
Copy link
Member

Sorry, I just tried it with quarkus.native.native-image-xmx=10g and it works. So it's like previous comments here says - Kubernetes client takes too much now.

@zakkak
Copy link
Contributor

zakkak commented Nov 21, 2023

I had a better look at #36312 and although I don't have a solution I see the following things that are concerning and are all related to getting the CI happy:

  1. The PR increases the max heap size used in the CI (so the issue @jerboaa actually reported here was already caught during the PR testing).
  2. The PR "hacks" the kubernetes-client integration test to exclude specific dependencies and also allow an incomplete classpath

The above changes apply only on the test meaning that what we test is not what the users will use. Furthermore, allowing an incomplete class path opens the doors for issues related to reachable but not included code going unnoticed.

Removing the make-ci-happy changes I get the following numbers that are even worse to the ones @jerboaa reported:

   35,548 reachable types   (94.4% of   37,641 total)
   75,427 reachable fields  (79.0% of   95,511 total)
  343,186 reachable methods (85.5% of  401,567 total)
   24,759 types, 51,561 fields, and 215,013 methods registered for reflection
       61 types,    61 fields, and    55 methods registered for JNI access
...
 131.64MB (48.15%) for code area:   305,363 compilation units
 141.43MB (51.72%) for image heap:1,024,956 objects and 156 resources
 364.20kB ( 0.13%) for other data
 273.43MB in total

which is probably closer to what Quarkus users will get, while before #36312 it was

    17,374 reachable types   (74.1% of   23,446 total)
   34,042 reachable fields  (71.1% of   47,886 total)
  125,928 reachable methods (67.0% of  188,051 total)
    7,162 types, 1,163 fields, and 43,454 methods registered for reflection
       61 types,    59 fields, and    55 methods registered for JNI access
...
  43.61MB (48.22%) for code area:    92,240 compilation units
  46.42MB (51.33%) for image heap:  465,045 objects and 59 resources
 410.02kB ( 0.44%) for other data
  90.43MB in total

I agree with @jerboaa that the most important thing here is not the higher resource utilization at build time, but why the same test now requires so much more data in the binary (which is probably related with the resource utilization at build time).

If this is the case, there are a few other unused modules that could be excluded too (these are the larger ones):

If they are unused only in the test, that's not the right way to go. If they are generally unused we should find a way to exclude them in general not only in the test.

I will try to do some analysis but I don't know what the ETA will be.

@manusa
Copy link
Contributor

manusa commented Nov 23, 2023

If they are unused only in the test, that's not the right way to go. If they are generally unused we should find a way to exclude them in general not only in the test.

Let me give a little bit of context so to make things clearer.

The io.fabric8:openshift-client module includes transitive dependencies to all of the openshift-model-xxx modules that contain the typed classes for all OpenShift managed objects.

The OpenShiftClient interface and its main implementation OpenShiftClientImpl contain methods such as public MixedOperation<BareMetalHost, BareMetalHostList, Resource<BareMetalHost>> bareMetalHosts() which provide easy access to those resources.

For example, one can easily query bareMetalHosts by doing openshiftClient.bareMetalHosts().list() or even more complex operations such as:

client.bareMetalHosts().withName("the-name")
  .edit(host -> new BareMetalHostBuilder(host).editMetadata().addToAnnotations("foo", "bar").endMetadata().build());

in a single statement.

The problem is that since these methods reference classes from those modules, even if the user knows they aren't going to use them and excludes the model modules, the native image compiling process still complains about the unlinked classes.

We're tracking this at fabric8io/kubernetes-client#5592

So besides what's proposed in #37278, other options would include breaking down the openshift-client into smaller clients.

@fedinskiy
Copy link
Contributor

@manusa in the comments to fabric8 issue you said[1], that you're planning a Quarkus-specific fix for the bug. Did it materialize? The issue still affects us as of 3.7.2.

[1] fabric8io/kubernetes-client#5592 (comment)

@manusa
Copy link
Contributor

manusa commented Feb 14, 2024

I can't remember now what I exactly meant with that comment (I should have elaborated more).

What I can remember is that I added the MiscellaneousSubstitutions and OperatorSubstitutions that allow for the <exclusions> to work in native mode.

With the current state of the client and the Quarkus extensions, I'm not sure there's something else that preserves the current API. The only options that I can think of are more aggressive.
Anyway, let's discuss this later and see if we can find a suitable solution.

@zakkak
Copy link
Contributor

zakkak commented Mar 27, 2024

This was more extensively discussed in #38683 and is now fixed by #38886 and fabric8io/kubernetes-client#5759

@zakkak zakkak closed this as completed Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants