Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Assigning opentelemetry_otlp layer causes hang when using tokio::spawn. #2071

Open
drwhut opened this issue Aug 31, 2024 · 7 comments
Assignees
Labels
bug Something isn't working triage:todo Needs to be traiged.

Comments

@drwhut
Copy link

drwhut commented Aug 31, 2024

What happened?

Hey! I'm having an issue when trying to use an opentelemetry_otlp layer in a tracing_subscriber, when spawning another task with tokio::spawn.

I've managed to make a minimal reproduction example. Here's the Cargo.toml that I have currently:

[dependencies]
opentelemetry = "0.24"
opentelemetry-otlp = "0.17"
tokio = { version = "1", features = ["full", "test-util"] }
tracing = "0.1"
tracing-opentelemetry = "0.25"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }

And this is the code that hangs:

use opentelemetry::trace::TracerProvider;
use tracing_subscriber::filter::{EnvFilter, LevelFilter};
use tracing_subscriber::{fmt, prelude::*};
use tracing::info;

#[tracing::instrument]
async fn some_function() {
    info!("some_function()");
    inner_function().await;
    info!("inner function done.");
}

#[tracing::instrument]
async fn inner_function() {
    info!("inner function!");
}

#[tokio::main]
async fn main() {
    let fmt_layer = fmt::layer().with_target(false);

    let filter_layer = EnvFilter::builder()
        .with_default_directive(LevelFilter::INFO.into())
        .from_env_lossy();

    let otlp_exporter = opentelemetry_otlp::new_exporter().tonic();

    let tracer = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_exporter(otlp_exporter)
        .install_simple()
        .unwrap()
        .tracer("test_server");

    let otel_layer = tracing_opentelemetry::layer().with_tracer(tracer);

    tracing_subscriber::registry()
        .with(filter_layer)
        .with(fmt_layer)
        .with(otel_layer)
        .init();

    info!("calling...");
    let handle = tokio::spawn(some_function());
    handle.await.unwrap();
    info!("done!");
}

If I run the code as is, this is the output that I get:

2024-08-29T15:13:10.063000Z  INFO calling...
2024-08-29T15:13:10.063108Z  INFO some_function: some_function()
2024-08-29T15:13:10.063138Z  INFO some_function:inner_function: inner function!

The program seems to hang when exiting inner_function(). This happens regardless of whether the OTLP collector is running on the system, or whether the Tokio runtime is running in a multi-thread or single-thread mode. However, if I don't add the OTLP layer, or if I stop using tokio::spawn(), and instead call some_function().await directly, then the program behaves as expected:

2024-08-29T15:15:24.392935Z  INFO calling...
2024-08-29T15:15:24.393030Z  INFO some_function: some_function()
2024-08-29T15:15:24.393059Z  INFO some_function:inner_function: inner function!
2024-08-29T15:15:24.577040Z  INFO some_function: inner function done.
2024-08-29T15:15:24.760723Z  INFO done!

Any ideas on what's going wrong here? Thank you in advance!

NOTE 1: I've also created an issue in the tracing-opentelemetry repository in the event that it's actually tracing-subscriber causing the issue: tokio-rs/tracing-opentelemetry#162

NOTE 2: I also noticed a similar looking issue in #1745, but I wasn't sure if it was the same bug as they do not use a Tokio runtime in their example.

API Version

0.24

SDK Version

N/A

What Exporter(s) are you seeing the problem on?

OTLP

Relevant log output

See above.
@drwhut drwhut added bug Something isn't working triage:todo Needs to be traiged. labels Aug 31, 2024
@nresare
Copy link

nresare commented Sep 3, 2024

I managed to reproduce this in a debugger. SimpleProcessor.on_end() blocks forever in the futures_executor::block_on() which in turn parks forever in run_executor().

It seems this crate assumes that futures_executor::block_on plays nicely with Tokio which I assume is not the case.

It would be good to find a proper solution to this, but in the meantime you can install_batch() instead of install_simple(), which will use a Tokio backed executor and seems to sidestep the problem.

@EinarasGar
Copy link

I also experience the same issue while updating from

tracing-opentelemetry = { version = "0.22.0" }
opentelemetry = { version = "0.21.0", features = ["metrics"] }
opentelemetry_sdk = "0.21.2"
opentelemetry-otlp = "0.14.0"

to

tracing-opentelemetry = { version = "0.25.0" }
opentelemetry = { version = "0.24.0", features = ["metrics"] }
opentelemetry_sdk = "0.24.1"
opentelemetry-otlp = "0.17.0"

It was working fine before and after updating it started to hang.

Using .install_batch(opentelemetry_sdk::runtime::Tokio) as suggested fixed my issue.

@cijothomas
Copy link
Member

I also experience the same issue while updating from

tracing-opentelemetry = { version = "0.22.0" }
opentelemetry = { version = "0.21.0", features = ["metrics"] }
opentelemetry_sdk = "0.21.2"
opentelemetry-otlp = "0.14.0"

to

tracing-opentelemetry = { version = "0.25.0" }
opentelemetry = { version = "0.24.0", features = ["metrics"] }
opentelemetry_sdk = "0.24.1"
opentelemetry-otlp = "0.17.0"

It was working fine before and after updating it started to hang.

Using .install_batch(opentelemetry_sdk::runtime::Tokio) as suggested fixed my issue.

Thanks for sharing more details! You are right. This is broken due to #1612, which used block_on for simpleprocessor which was previously using a dedicated background thread in the background! We won't fix this problem for the upcoming release (expected tonight/tomorrow), but will find a way to avoid these issues in the next release.

@cijothomas cijothomas self-assigned this Sep 4, 2024
@cijothomas
Copy link
Member

Related #2031
Given exporter has async, it forces some runtime, or something like futures_executor, which is also causing issues as reported here. The resolution to #2031 would influence the fix here too.

@lalitb
Copy link
Member

lalitb commented Sep 4, 2024

Attached the debugger on the otel-rust code, to see exactly which line it is stuck at, it is exactly here -

which is called in below flow:

SimpleProcessor.on_end() -> futures_executor::block_on() -> TonicTracesClient::export() -> self.inner.unary()

So, it hangs in processing the gRPC response from collector. The issue is same as describer here -
gRPC call hangs after receiving server response but before returning response to caller, and the reason mentioned are due to intermixing of futures::executors::block_on with #[tokio::main] on the main function.
This validates the findings in comments above.

@cijothomas
Copy link
Member

@lalitb Thanks for digging further! It looks like we need to redesign SimpleProcessors too, as it cannot be used with #[tokio:main].

@glueball
Copy link

I don't know if it is related to this, but I've seen that opentelemetry-otlp v0.25 forces the downgrade of Tokio v1.40 to Tokio v1.38.1 because apparently the toml uses a "~" dependency (as seen in here: https://crates.io/crates/opentelemetry-otlp/0.25.0/dependencies).

The previous version (v0.17) used a caret dependency (https://crates.io/crates/opentelemetry-otlp/0.17.0/dependencies).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage:todo Needs to be traiged.
Projects
None yet
Development

No branches or pull requests

7 participants
@nresare @lalitb @glueball @cijothomas @drwhut @EinarasGar and others