Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault during thread execution with dynamic security #9330

Closed
pepperkick opened this issue Aug 29, 2023 · 11 comments
Closed

Segmentation fault during thread execution with dynamic security #9330

pepperkick opened this issue Aug 29, 2023 · 11 comments
Labels
bug Something isn't working needs author feedback Waiting for additional feedback from the author needs repro stale

Comments

@pepperkick
Copy link

Describe the bug

Hi Community,

I work in a team whose goal is to enable observability for all the company products which are mostly written in java.

We have an application which runs in Kuberentes as a deployment. The application is deployed using Tomcat server. We injected OpenTelemetry agent via annotation to the application deployment manifest. The agent is successfully injected with the application and we are able to see the basic traces and logs.

During a task it downloads certain jars during runtime and loads them. One of those downloaded jar (proprietary, lets call it dep.jar) has functionality to load security dynamically and then execute a thread pool. The security feature in dep.jar can be enabled or disabled via env variable.

Now, These are the observations

  • If otel agent is injected and dep.jar has security feature is disabled then everything works fine.

  • If otel agent is injected and dep.jar has security feature is enabled then the application crashes with the following error

/bootstrap.sh: line 20:    12 Segmentation fault      sh /opt/webserver/bin/launch.sh

No stack trace or any error log printed prior to this (except few logs from dep.jar which helped pinpoint to security feature)

  • If otel and appd agents both are injected and dep.jar has security feature is enabled then the application prints the following error but does not crash
java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "setContextClassLoader")

We want to understand why injecting appd agent along with otel agent fixes the Segmentation fault issue. We also suppressed certain libraries (listed below) which fixes the AccessControlException issue but only when appd agent is injected. If appd agent is not injected and only otel agent then it fails with Segmentation fault issue regardless of suppression.

Following libraries were suppressed

  • hystrix
  • logback-appender
  • spring-scheduling

Steps to reproduce

Currently not available since proprietary application is used, will try to figure out a easier way to reproduce this

Expected behavior

Expected behavior is that the application will work when only otel agent is injected and libraries are suppressed without appd agent injected.

Actual behavior

Actual behavior is that application crashes with Segmentation fault when otel agent is injected and without appd agent injected.

Javaagent or library instrumentation version

1.23.0

Environment

JDK:
Picked up JAVA_TOOL_OPTIONS: -javaagent:/otel-auto-instrumentation/javaagent.jar
[otel.javaagent 2023-08-22 13:49:36:433 +0000] [main] INFO io.opentelemetry.javaagent.tooling.VersionLogger - opentelemetry-javaagent - version: 1.29.0
22-Aug-2023 13:49:44.273 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server version name: Apache Tomcat/8.5.81
22-Aug-2023 13:49:44.282 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server built: Jun 8 2022 21:30:15 UTC
22-Aug-2023 13:49:44.283 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Server version number: 8.5.81.0
22-Aug-2023 13:49:44.283 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log OS Name: Linux
22-Aug-2023 13:49:44.283 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log OS Version: 4.18.0-425.10.1.el8_7.x86_64
22-Aug-2023 13:49:44.283 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Architecture: amd64
22-Aug-2023 13:49:44.283 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Java Home: /usr/lib/jvm/java-1.8.0-amazon-corretto/jre
22-Aug-2023 13:49:44.283 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log JVM Version: 1.8.0_382-b05
22-Aug-2023 13:49:44.284 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log JVM Vendor: Amazon.com Inc.
22-Aug-2023 13:49:44.284 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log CATALINA_BASE: /opt/apache-tomcat-8.5.81
22-Aug-2023 13:49:44.284 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log CATALINA_HOME: /opt/apache-tomcat-8.5.81
22-Aug-2023 13:49:44.284 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -javaagent:/otel-auto-instrumentation/javaagent.jar
22-Aug-2023 13:49:44.284 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Djava.util.logging.config.file=/opt/webserver/conf/logging.properties
22-Aug-2023 13:49:44.285 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
22-Aug-2023 13:49:44.285 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -XX:+UseStringDeduplication
22-Aug-2023 13:49:44.285 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -XX:MaxRAMPercentage=70.0
22-Aug-2023 13:49:44.285 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -XX:MaxMetaspaceSize=4096m
22-Aug-2023 13:49:44.285 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -XX:+UseParallelGC
22-Aug-2023 13:49:44.285 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -XX:ParallelGCThreads=1
22-Aug-2023 13:49:44.285 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -XX:MaxGCPauseMillis=400
22-Aug-2023 13:49:44.285 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Xms1g
22-Aug-2023 13:49:44.286 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -XX:ReservedCodeCacheSize=512m
22-Aug-2023 13:49:44.286 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -XX:+UseCodeCacheFlushing
22-Aug-2023 13:49:44.286 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Djdk.tls.ephemeralDHKeySize=2048
22-Aug-2023 13:49:44.286 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Djava.protocol.handler.pkgs=org.apache.catalina.webresources
22-Aug-2023 13:49:44.286 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Dorg.apache.catalina.security.SecurityListener.UMASK=0027
22-Aug-2023 13:49:44.286 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Djava.security.egd=file:/dev/./urandom
22-Aug-2023 13:49:44.286 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Dignore.endorsed.dirs=
22-Aug-2023 13:49:44.287 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Dcatalina.base=/opt/webserver
22-Aug-2023 13:49:44.287 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Dcatalina.home=/opt/webserver
22-Aug-2023 13:49:44.287 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Djava.io.tmpdir=/opt/webserver/temp

Java Version:
openjdk version "1.8.0_382"
OpenJDK Runtime Environment Corretto-8.382.05.1 (build 1.8.0_382-b05)
OpenJDK 64-Bit Server VM Corretto-8.382.05.1 (build 25.382-b05, mixed mode)

OS:
RHEL 8
amazoncorretto:8

Additional context

No response

@pepperkick pepperkick added bug Something isn't working needs triage New issue that requires triage labels Aug 29, 2023
@laurit
Copy link
Contributor

laurit commented Aug 29, 2023

Usually when openjdk segmentation faults it generates a hs_err_pid<pid>.log file with details of the crash in the current directory of the path set with -XX:ErrorFile. See if you can find this crash log. For running application with security manager see https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/advanced-configuration-options.md#running-application-with-security-manager it may help agains the AccessControlExceptions.

@laurit laurit added the needs author feedback Waiting for additional feedback from the author label Aug 29, 2023
@mateuszrzeszutek mateuszrzeszutek removed the needs triage New issue that requires triage label Aug 29, 2023
@pepperkick
Copy link
Author

pepperkick commented Aug 29, 2023

Usually when openjdk segmentation faults it generates a hs_err_pid<pid>.log file with details of the crash in the current directory of the path set with -XX:ErrorFile. See if you can find this crash log. For running application with security manager see https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/advanced-configuration-options.md#running-application-with-security-manager it may help agains the AccessControlExceptions.

Appreciate the quick reply. I tried setting the variable OTEL_JAVAAGENT_EXPERIMENTAL_SECURITY_MANAGER_SUPPORT_ENABLED but still faced the Segmentation Fault error. Maybe this is missing in v1.23 so will check later with newer version.

I am not sure but I do not see any hs_err log files. I tried setting -XX:ErrorFile=/tmp/error.log option and made sure it is propagated in pod logs but no error.log file is generated so not sure if the error is getting captured or not.

Will try out more things meanwhile and update if I find more info.
Any other input is also appreciated.

@github-actions github-actions bot removed the needs author feedback Waiting for additional feedback from the author label Aug 29, 2023
@laurit
Copy link
Contributor

laurit commented Aug 29, 2023

You are correct, that flag is missing in 1.23.0.
Based on the current information there isn't anything we can do. Usually when jvm crashes it is a bug in the jvm, you may wish to try a different jvm version (e.g. jdk 11) and see if it works better. In case of a jvm crash common advice is to disable the JIT compiler with -Xint and see if this changes the situation. You could also enable core dumps and see if that reveals clues about the nature of the crash. You could also add more debug logging to your application and try to determine whether it crashes in some specific place, perhaps it is possible to create a small program that reproduces the issue that way.

@pepperkick
Copy link
Author

Thanks for the additional suggestions.

Trying different JDK version unfortunately is not an option currently since the application was created with java 8 and was never upgraded or tested with new java version.

I will try disabling JIT and see what happens. Also as per suggestion I will try to build a simple application which reproduces the issue. I have started going through the code base for this.

I did try dumping core but could not find anything meaningful pointing to the JVM crash, that said I am not really an expert on Java related debugging stuff.

@breedx-splk
Copy link
Contributor

@pepperkick Do you know if the proprietary dep.jar is doing anything with native code, like JNI? Have you also opened an issue with the team that creates dep.jar to see if it is something they can look into?

@breedx-splk breedx-splk added the needs author feedback Waiting for additional feedback from the author label Aug 29, 2023
@pepperkick
Copy link
Author

@pepperkick Do you know if the proprietary dep.jar is doing anything with native code, like JNI? Have you also opened an issue with the team that creates dep.jar to see if it is something they can look into?

As far as I am aware it does not. Will check with the team when I can. All are busy with release cycle so unable to get much info.

Also I want to understand why adding the appd agent fixes the Segmentation Fault? Is appd agent known to fix some issues similar to this?

@github-actions github-actions bot removed the needs author feedback Waiting for additional feedback from the author label Sep 2, 2023
@trask
Copy link
Member

trask commented Sep 4, 2023

Also I want to understand why adding the appd agent fixes the Segmentation Fault? Is appd agent known to fix some issues similar to this?

you'd need to ask the appd folks, but it wouldn't be surprising to me if some unexpected interaction with appd avoids the JVM crash (see heisenbug 😄)

@trask trask added the needs author feedback Waiting for additional feedback from the author label Sep 4, 2023
@pepperkick
Copy link
Author

Also I want to understand why adding the appd agent fixes the Segmentation Fault? Is appd agent known to fix some issues similar to this?

you'd need to ask the appd folks, but it wouldn't be surprising to me if some unexpected interaction with appd avoids the JVM crash (see heisenbug 😄)

Yeah, I agree :P
Wanted to confirm if something is already known around this. Seems like it is an edge case.
I have not managed to reproduce the issue separately yet. It will take some time I think.

@github-actions github-actions bot removed the needs author feedback Waiting for additional feedback from the author label Sep 4, 2023
@laurit
Copy link
Contributor

laurit commented Sep 4, 2023

You could try with a fastdebug build from https://downloads.corretto.aws/#/downloads?build=nightly&version=8 Fastdebug builds contain extra assertions that are removed from release builds, perhaps this will provide a clue about why the crash happens.

@trask
Copy link
Member

trask commented Dec 24, 2024

I think we would need a minimal application that reproduces this issue in order to investigate further.

@laurit laurit added the needs author feedback Waiting for additional feedback from the author label Dec 25, 2024
Copy link
Contributor

github-actions bot commented Jan 7, 2025

This has been automatically marked as stale because it has been marked as needing author feedback and has not had any activity for 7 days. It will be closed automatically if there is no response from the author within 7 additional days from this comment.

@github-actions github-actions bot added the stale label Jan 7, 2025
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs author feedback Waiting for additional feedback from the author needs repro stale
Projects
None yet
Development

No branches or pull requests

5 participants