Skip to content

Commit 1d57bc8

Browse files
committed
Make sending traces to Tempo less confusing (#4771)
Right now we have a confusing mix of features and naming around telemetry export. The code references both `otel` and `telemetry_only`, and it's unclear how to opt in or out of sending trace information to Tempo/OpenTelemetry backends. Additionally, there are a few bugs preventing OpenTelemetry from working correctly in production deployments. Instead of `telemetry_only`, use `opentelemetry.skip` as a span field to opt out of sending specific spans to Tempo/OpenTelemetry. By default, all spans are sent to OpenTelemetry/Tempo backends (when the `opentelemetry` feature is enabled), regardless of log level. To exclude a span from OpenTelemetry export while keeping it in local traces (Chrome trace, logs), add `opentelemetry.skip` to the span fields: ```rust // This span will NOT be sent to Tempo fn internal_helper() { } // This span WILL be sent to Tempo (default behavior) fn public_api() { } ``` **Key Changes:** **Naming & Clarity:** - Renamed `otel` feature to `opentelemetry` for clarity - Renamed `--otel-exporter-otlp-endpoint` CLI flag to `--otlp-exporter-endpoint` - Renamed `--otel-trace-file` CLI flag to `--chrome-trace-file` (more accurate) - Standardized all environment variables to use `LINERA_OTLP_EXPORTER_ENDPOINT` - Chrome trace export includes ALL spans (useful for local debugging) - OpenTelemetry/Tempo export respects the `opentelemetry.skip` filter **Bug Fixes:** - **Fixed empty endpoint validation**: Previously, passing an empty string as the OTLP endpoint would cause the exporter to fail with "Failed to parse endpoint". Now properly validates that the endpoint is non-empty before attempting to create the exporter. - **Fixed faucet crash with OpenTelemetry**: The OTLP exporter was being initialized before the Tokio runtime was created, causing a panic: "there is no reactor running, must be called from the context of a Tokio 1.x runtime". Moved tracing initialization to after runtime creation. **Test Improvements:** - Added proper feature guards (`with_testing`) for test-only functions - CI tests verify that spans with `opentelemetry.skip` are filtered from OpenTelemetry export but still appear in Chrome traces - CI tests verify that spans with `opentelemetry.skip` are filtered from OpenTelemetry export but still appear in Chrome traces - All existing tests pass with the renamed feature and CLI flags - Manual testing confirms OTLP traces now successfully export to Tempo without crashes - These changes should be backported to the latest `testnet` branch, then - be released in a validator hotfix. Might be good to backport just so if we want to add more instrumentation on the testnet, things are not as confusing :)
1 parent 5660150 commit 1d57bc8

File tree

21 files changed

+412
-330
lines changed

21 files changed

+412
-330
lines changed

.github/workflows/rust.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,7 @@ jobs:
119119
- uses: actions-rust-lang/setup-rust-toolchain@v1
120120
- name: Run metrics tests
121121
run: |
122-
cargo test --locked -p linera-base --features metrics
122+
cargo test --locked -p linera-base --features metrics,opentelemetry
123123
124124
wasm-application-test:
125125
needs: changed-files

CLI.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -147,8 +147,8 @@ Client implementation and command-line tool for the Linera blockchain
147147

148148
Default value: `10`
149149
* `--chrome-trace-exporter` — Enable OpenTelemetry Chrome JSON exporter for trace data analysis
150-
* `--otel-trace-file <OTEL_TRACE_FILE>` — Output file path for Chrome trace JSON format. Can be visualized in chrome://tracing or Perfetto UI
151-
* `--otel-exporter-otlp-endpoint <OTEL_EXPORTER_OTLP_ENDPOINT>` — OpenTelemetry OTLP exporter endpoint (requires tempo feature)
150+
* `--chrome-trace-file <CHROME_TRACE_FILE>` — Output file path for Chrome trace JSON format. Can be visualized in chrome://tracing or Perfetto UI
151+
* `--otlp-exporter-endpoint <OTLP_EXPORTER_ENDPOINT>` — OpenTelemetry OTLP exporter endpoint (requires opentelemetry feature)
152152
* `--wait-for-outgoing-messages` — Whether to wait until a quorum of validators has confirmed that all sent cross-chain messages have been delivered
153153
* `--long-lived-services` — (EXPERIMENTAL) Whether application services can persist in some cases between queries
154154
* `--blanket-message-policy <BLANKET_MESSAGE_POLICY>` — The policy for handling incoming messages

docker/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ ARG binaries=
2525
ARG copy=${binaries:+_copy}
2626
ARG build_flag=--release
2727
ARG build_folder=release
28-
ARG build_features=scylladb,metrics,memory-profiling,tempo
28+
ARG build_features=scylladb,metrics,memory-profiling,opentelemetry
2929
ARG rustflags="-C force-frame-pointers=yes"
3030

3131
FROM rust:1.74-slim-bookworm AS builder

kubernetes/linera-validator/templates/proxy.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -61,8 +61,8 @@ spec:
6161
value: {{ .Values.logLevel }}
6262
- name: RUST_BACKTRACE
6363
value: "1"
64-
- name: OTEL_EXPORTER_OTLP_ENDPOINT
65-
value: {{ .Values.otelExporterEndpoint }}
64+
- name: LINERA_OTLP_EXPORTER_ENDPOINT
65+
value: {{ .Values.otlpExporterEndpoint }}
6666
containers:
6767
- name: linera-proxy
6868
imagePullPolicy: {{ .Values.lineraImagePullPolicy }}
@@ -76,8 +76,8 @@ spec:
7676
env:
7777
- name: RUST_LOG
7878
value: {{ .Values.logLevel }}
79-
- name: OTEL_EXPORTER_OTLP_ENDPOINT
80-
value: {{ .Values.otelExporterEndpoint }}
79+
- name: LINERA_OTLP_EXPORTER_ENDPOINT
80+
value: {{ .Values.otlpExporterEndpoint }}
8181
volumeMounts:
8282
- name: config
8383
mountPath: "/config"

kubernetes/linera-validator/templates/shards.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,8 @@ spec:
4141
value: {{ .Values.logLevel }}
4242
- name: RUST_BACKTRACE
4343
value: "1"
44-
- name: OTEL_EXPORTER_OTLP_ENDPOINT
45-
value: {{ .Values.otelExporterEndpoint }}
44+
- name: LINERA_OTLP_EXPORTER_ENDPOINT
45+
value: {{ .Values.otlpExporterEndpoint }}
4646
volumeMounts:
4747
- name: config
4848
mountPath: "/config"
@@ -59,8 +59,8 @@ spec:
5959
env:
6060
- name: RUST_LOG
6161
value: {{ .Values.logLevel }}
62-
- name: OTEL_EXPORTER_OTLP_ENDPOINT
63-
value: {{ .Values.otelExporterEndpoint }}
62+
- name: LINERA_OTLP_EXPORTER_ENDPOINT
63+
value: {{ .Values.otlpExporterEndpoint }}
6464
{{- if .Values.serverTokioThreads }}
6565
- name: LINERA_SERVER_TOKIO_THREADS
6666
value: "{{ .Values.serverTokioThreads }}"

kubernetes/linera-validator/values-local.yaml.gotmpl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
lineraImage: {{ env "LINERA_HELMFILE_LINERA_IMAGE" | default "linera:latest" }}
55
lineraImagePullPolicy: Never
66
logLevel: "debug"
7-
otelExporterEndpoint: {{ env "LINERA_HELMFILE_SET_OTEL_ENDPOINT" | default "http://tempo.tempo.svc.cluster.local:4317" }}
7+
otlpExporterEndpoint: {{ env "LINERA_HELMFILE_SET_OTLP_EXPORTER_ENDPOINT" | default "" }}
88
proxyPort: 19100
99
metricsPort: 21100
1010
numShards: {{ env "LINERA_HELMFILE_SET_NUM_SHARDS" | default 10 }}

linera-base/Cargo.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ workspace = true
1818
metrics = ["prometheus"]
1919
reqwest = ["dep:reqwest"]
2020
revm = []
21-
tempo = ["opentelemetry-otlp"]
21+
opentelemetry = ["opentelemetry-otlp"]
2222
test = ["test-strategy", "proptest"]
2323
web = [
2424
"getrandom/js",
@@ -83,7 +83,7 @@ tracing-web = { optional = true, workspace = true }
8383
chrono.workspace = true
8484
opentelemetry.workspace = true
8585
opentelemetry-otlp = { workspace = true, optional = true }
86-
opentelemetry_sdk.workspace = true
86+
opentelemetry_sdk = { workspace = true, features = ["testing"] }
8787
tracing-chrome.workspace = true
8888
tracing-opentelemetry.workspace = true
8989
rand = { workspace = true, features = ["getrandom", "std", "std_rng"] }

linera-base/src/lib.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ pub mod time;
4141
#[cfg_attr(web, path = "tracing_web.rs")]
4242
pub mod tracing;
4343
#[cfg(not(target_arch = "wasm32"))]
44-
pub mod tracing_otel;
44+
pub mod tracing_opentelemetry;
4545
#[cfg(test)]
4646
mod unit_tests;
4747

linera-base/src/tracing.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ use tracing_subscriber::{
2626
};
2727

2828
#[cfg(not(target_arch = "wasm32"))]
29-
pub use crate::tracing_otel::{
29+
pub use crate::tracing_opentelemetry::{
3030
init_with_chrome_trace_exporter, init_with_opentelemetry, ChromeTraceGuard,
3131
};
3232

Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
// Copyright (c) Zefchain Labs, Inc.
2+
// SPDX-License-Identifier: Apache-2.0
3+
4+
//! OpenTelemetry integration for tracing with OTLP export and Chrome trace export.
5+
6+
use tracing_chrome::ChromeLayerBuilder;
7+
use tracing_subscriber::{layer::SubscriberExt as _, util::SubscriberInitExt as _};
8+
#[cfg(feature = "opentelemetry")]
9+
use {
10+
opentelemetry::{global, trace::TracerProvider},
11+
opentelemetry_otlp::{SpanExporter, WithExportConfig},
12+
opentelemetry_sdk::{
13+
trace::{InMemorySpanExporter, SdkTracerProvider},
14+
Resource,
15+
},
16+
tracing_opentelemetry::OpenTelemetryLayer,
17+
tracing_subscriber::{
18+
filter::{filter_fn, FilterFn},
19+
layer::Layer,
20+
},
21+
};
22+
23+
/// Creates a filter that excludes spans with the `opentelemetry.skip` field.
24+
///
25+
/// Any span that declares an `opentelemetry.skip` field will be excluded from export,
26+
/// regardless of the field's value. This is a limitation of the tracing metadata API.
27+
///
28+
/// Usage examples:
29+
/// ```ignore
30+
/// // Always skip this span
31+
/// #[tracing::instrument(fields(opentelemetry.skip = true))]
32+
/// fn internal_helper() { }
33+
///
34+
/// // Conditionally skip based on a parameter
35+
/// #[tracing::instrument(fields(opentelemetry.skip = should_skip))]
36+
/// fn my_function(should_skip: bool) {
37+
/// // Will be skipped if should_skip is true when called
38+
/// // Note: The field must be declared in the span, so the span is
39+
/// // created with knowledge that it might be skipped
40+
/// }
41+
/// ```
42+
#[cfg(feature = "opentelemetry")]
43+
fn opentelemetry_skip_filter() -> FilterFn<impl Fn(&tracing::Metadata<'_>) -> bool> {
44+
filter_fn(|metadata| {
45+
if !metadata.is_span() {
46+
return false;
47+
}
48+
metadata.fields().field("opentelemetry.skip").is_none()
49+
})
50+
}
51+
52+
/// Initializes tracing with a custom OpenTelemetry tracer provider.
53+
///
54+
/// This is an internal function used by both production and test code.
55+
#[cfg(feature = "opentelemetry")]
56+
fn init_with_tracer_provider(log_name: &str, tracer_provider: SdkTracerProvider) {
57+
global::set_tracer_provider(tracer_provider.clone());
58+
let tracer = tracer_provider.tracer("linera");
59+
60+
let opentelemetry_layer =
61+
OpenTelemetryLayer::new(tracer).with_filter(opentelemetry_skip_filter());
62+
63+
let config = crate::tracing::get_env_config(log_name);
64+
let maybe_log_file_layer = config.maybe_log_file_layer();
65+
let stderr_layer = config.stderr_layer();
66+
67+
tracing_subscriber::registry()
68+
.with(opentelemetry_layer)
69+
.with(config.env_filter)
70+
.with(maybe_log_file_layer)
71+
.with(stderr_layer)
72+
.init();
73+
}
74+
75+
/// Builds an OpenTelemetry layer with the opentelemetry.skip filter.
76+
///
77+
/// This is used for testing to avoid setting the global subscriber.
78+
/// Returns the layer, exporter, and tracer provider (which must be kept alive and shutdown).
79+
#[cfg(all(with_testing, feature = "opentelemetry"))]
80+
pub fn build_opentelemetry_layer_with_test_exporter(
81+
log_name: &str,
82+
) -> (
83+
impl tracing_subscriber::Layer<tracing_subscriber::Registry>,
84+
InMemorySpanExporter,
85+
SdkTracerProvider,
86+
) {
87+
let exporter = InMemorySpanExporter::default();
88+
let exporter_clone = exporter.clone();
89+
90+
let resource = Resource::builder()
91+
.with_service_name(log_name.to_string())
92+
.build();
93+
94+
let tracer_provider = SdkTracerProvider::builder()
95+
.with_resource(resource)
96+
.with_simple_exporter(exporter)
97+
.with_sampler(opentelemetry_sdk::trace::Sampler::AlwaysOn)
98+
.build();
99+
100+
global::set_tracer_provider(tracer_provider.clone());
101+
let tracer = tracer_provider.tracer("linera");
102+
let opentelemetry_layer =
103+
OpenTelemetryLayer::new(tracer).with_filter(opentelemetry_skip_filter());
104+
105+
(opentelemetry_layer, exporter_clone, tracer_provider)
106+
}
107+
108+
/// Initializes tracing with OpenTelemetry OTLP exporter.
109+
///
110+
/// Exports traces using the OTLP protocol to any OpenTelemetry-compatible backend.
111+
/// Requires the `opentelemetry` feature.
112+
/// Only enables OpenTelemetry if LINERA_OTLP_EXPORTER_ENDPOINT env var is set.
113+
/// This prevents DNS errors in environments where OpenTelemetry is not deployed.
114+
#[cfg(feature = "opentelemetry")]
115+
pub fn init_with_opentelemetry(log_name: &str, otlp_endpoint: Option<&str>) {
116+
// Check if OpenTelemetry endpoint is configured via parameter or env var
117+
let endpoint = match otlp_endpoint {
118+
Some(ep) if !ep.is_empty() => ep.to_string(),
119+
_ => match std::env::var("LINERA_OTLP_EXPORTER_ENDPOINT") {
120+
Ok(ep) if !ep.is_empty() => ep,
121+
_ => {
122+
eprintln!(
123+
"LINERA_OTLP_EXPORTER_ENDPOINT not set and no endpoint provided. \
124+
Falling back to standard tracing without OpenTelemetry support."
125+
);
126+
crate::tracing::init(log_name);
127+
return;
128+
}
129+
},
130+
};
131+
132+
let resource = Resource::builder()
133+
.with_service_name(log_name.to_string())
134+
.build();
135+
136+
let exporter = SpanExporter::builder()
137+
.with_tonic()
138+
.with_endpoint(endpoint)
139+
.build()
140+
.expect("Failed to create OTLP exporter");
141+
142+
let tracer_provider = SdkTracerProvider::builder()
143+
.with_resource(resource)
144+
.with_batch_exporter(exporter)
145+
.with_sampler(opentelemetry_sdk::trace::Sampler::AlwaysOn)
146+
.build();
147+
148+
init_with_tracer_provider(log_name, tracer_provider);
149+
}
150+
151+
/// Fallback when opentelemetry feature is not enabled.
152+
#[cfg(not(feature = "opentelemetry"))]
153+
pub fn init_with_opentelemetry(log_name: &str, _otlp_endpoint: Option<&str>) {
154+
eprintln!(
155+
"OTLP export requires the 'opentelemetry' feature to be enabled! Falling back to default tracing initialization."
156+
);
157+
crate::tracing::init(log_name);
158+
}
159+
160+
/// Guard that flushes Chrome trace file when dropped.
161+
///
162+
/// Store this guard in a variable that lives for the duration of your program.
163+
/// When it's dropped, the trace file will be completed and closed.
164+
pub type ChromeTraceGuard = tracing_chrome::FlushGuard;
165+
166+
/// Builds a Chrome trace layer and guard.
167+
///
168+
/// Returns a subscriber and guard. The subscriber should be used with `with_default`
169+
/// to avoid global state conflicts.
170+
pub fn build_chrome_trace_layer_with_exporter<W>(
171+
log_name: &str,
172+
writer: W,
173+
) -> (impl tracing::Subscriber + Send + Sync, ChromeTraceGuard)
174+
where
175+
W: std::io::Write + Send + 'static,
176+
{
177+
let (chrome_layer, guard) = ChromeLayerBuilder::new().writer(writer).build();
178+
179+
let config = crate::tracing::get_env_config(log_name);
180+
let maybe_log_file_layer = config.maybe_log_file_layer();
181+
let stderr_layer = config.stderr_layer();
182+
183+
let subscriber = tracing_subscriber::registry()
184+
.with(chrome_layer)
185+
.with(config.env_filter)
186+
.with(maybe_log_file_layer)
187+
.with(stderr_layer);
188+
189+
(subscriber, guard)
190+
}
191+
192+
/// Initializes tracing with Chrome Trace JSON exporter.
193+
///
194+
/// Returns a guard that must be kept alive for the duration of the program.
195+
/// When the guard is dropped, the trace data is flushed and completed.
196+
///
197+
/// Exports traces to Chrome Trace JSON format which can be visualized in:
198+
/// - Chrome: `chrome://tracing`
199+
/// - Perfetto UI: <https://ui.perfetto.dev>
200+
///
201+
/// Note: Uses `try_init()` to avoid panicking if a global subscriber is already set.
202+
/// In that case, tracing may not work as expected.
203+
pub fn init_with_chrome_trace_exporter<W>(log_name: &str, writer: W) -> ChromeTraceGuard
204+
where
205+
W: std::io::Write + Send + 'static,
206+
{
207+
let (subscriber, guard) = build_chrome_trace_layer_with_exporter(log_name, writer);
208+
let _ = subscriber.try_init();
209+
guard
210+
}

0 commit comments

Comments
 (0)