Skip to content

Conversation

@ndr-ds
Copy link
Contributor

@ndr-ds ndr-ds commented Oct 9, 2025

Motivation

Right now we have a confusing mix of features and naming around telemetry export. The code references both otel and telemetry_only, and it's unclear how to opt in or out of sending trace information to Tempo/OpenTelemetry backends. Additionally, there are a few bugs preventing OpenTelemetry from working correctly in production deployments.

Proposal

Instead of telemetry_only, use opentelemetry.skip as a span field to opt out of sending specific spans to Tempo/OpenTelemetry.

By default, all spans are sent to OpenTelemetry/Tempo backends (when the opentelemetry feature is enabled), regardless of log level. To exclude a span from OpenTelemetry export while keeping it in local traces (Chrome trace, logs), add opentelemetry.skip to the span fields:

// This span will NOT be sent to Tempo
#[instrument(fields(opentelemetry.skip = true))]
fn internal_helper() { }

// This span WILL be sent to Tempo (default behavior)  
#[instrument]
fn public_api() { }

Key Changes:

Naming & Clarity:

  • Renamed otel feature to opentelemetry for clarity
  • Renamed --otel-exporter-otlp-endpoint CLI flag to --otlp-exporter-endpoint
  • Renamed --otel-trace-file CLI flag to --chrome-trace-file (more accurate)
  • Standardized all environment variables to use LINERA_OTLP_EXPORTER_ENDPOINT
  • Chrome trace export includes ALL spans (useful for local debugging)
  • OpenTelemetry/Tempo export respects the opentelemetry.skip filter

Bug Fixes:

  • Fixed empty endpoint validation: Previously, passing an empty string as the OTLP endpoint would cause the exporter to fail with "Failed to parse endpoint". Now properly validates that the endpoint is non-empty before attempting to create the exporter.
  • Fixed faucet crash with OpenTelemetry: The OTLP exporter was being initialized before the Tokio runtime was created, causing a panic: "there is no reactor running, must be called from the context of a Tokio 1.x runtime". Moved tracing initialization to after runtime creation.

Test Improvements:

  • Added proper feature guards (with_testing) for test-only functions
  • CI tests verify that spans with opentelemetry.skip are filtered from OpenTelemetry export but still appear in Chrome traces

Test Plan

  • CI tests verify that spans with opentelemetry.skip are filtered from OpenTelemetry export but still appear in Chrome traces
  • All existing tests pass with the renamed feature and CLI flags
  • Manual testing confirms OTLP traces now successfully export to Tempo without crashes

Release Plan

  • These changes should be backported to the latest testnet branch, then
    • be released in a validator hotfix.

Might be good to backport just so if we want to add more instrumentation on the testnet, things are not as confusing :)

Copy link
Contributor Author

ndr-ds commented Oct 9, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@ndr-ds ndr-ds force-pushed the 10-09-make_sending_traces_to_tempo_less_confusing branch from 49d1ce7 to cc17124 Compare October 9, 2025 18:43
@ndr-ds ndr-ds force-pushed the 10-09-make_sending_traces_to_tempo_less_confusing branch from cc17124 to 3239148 Compare October 9, 2025 18:52
@ndr-ds ndr-ds requested review from Twey, afck, deuszx and ma2bd October 9, 2025 20:47
@ndr-ds ndr-ds marked this pull request as ready for review October 9, 2025 20:47
@ndr-ds ndr-ds force-pushed the 10-09-make_sending_traces_to_tempo_less_confusing branch 2 times, most recently from 6598199 to a96c5e8 Compare October 15, 2025 13:58

let telemetry_only_filter =
filter_fn(|metadata| metadata.is_span() && metadata.target() == "telemetry_only");
let tempo_filter = filter_fn(|metadata| metadata.is_span() && metadata.target() != "no_tempo");
Copy link
Contributor

@Twey Twey Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface here is OpenTelemetry; Tempo is our OpenTelemetry consumer, provided by the infra, so it shouldn't be named in the codebase. Some validators might prefer to use a different OpenTelemetry aggregator for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I guess this is not Tempo specific at all


let telemetry_only_filter =
filter_fn(|metadata| metadata.is_span() && metadata.target() == "telemetry_only");
let tempo_filter = filter_fn(|metadata| metadata.is_span() && metadata.target() != "no_tempo");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The target is useful to have in events. Rather than a custom target, could we just standardize a field (e.g. "opentelemetry/disable": true)?

@ndr-ds ndr-ds force-pushed the 10-09-make_sending_traces_to_tempo_less_confusing branch from a96c5e8 to 9e69778 Compare October 15, 2025 19:23
@ndr-ds ndr-ds requested a review from Twey October 15, 2025 19:24
/// Fallback when tempo feature is not enabled.
#[cfg(not(feature = "tempo"))]
/// Fallback when otel feature is not enabled.
#[cfg(not(feature = "otel"))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I don't love the otel abbreviation. We mostly try to avoid abbreviations as a style policy.

Copy link
Contributor

@Twey Twey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me :) (modulo formatting!)

metadata
.fields()
.field("otel.skip")
.is_none()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it would be nice to require the value be true in order to disable; this makes it easier to do more complex conditionals, like

#[tracing::instrument(fields(otel.skip = !otel))]
fn foo(otel: bool) {}

metadata
.fields()
.field("otel.skip")
.is_none()
Copy link
Contributor

@Twey Twey Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: to avoid the double negative, we could alternatively name it otel.emit, and skip only when it's false (i.e. the default value if not provided is considered to be true).

@ndr-ds ndr-ds force-pushed the 10-09-make_sending_traces_to_tempo_less_confusing branch 10 times, most recently from ef9cc62 to 0c1cadf Compare October 20, 2025 20:15
@ndr-ds ndr-ds force-pushed the 10-09-make_sending_traces_to_tempo_less_confusing branch from 0c1cadf to 481c260 Compare October 20, 2025 20:31
@ndr-ds ndr-ds added this pull request to the merge queue Oct 20, 2025
Merged via the queue into main with commit 0ec9a84 Oct 20, 2025
35 checks passed
@ndr-ds ndr-ds deleted the 10-09-make_sending_traces_to_tempo_less_confusing branch October 20, 2025 21:48
ndr-ds added a commit that referenced this pull request Oct 30, 2025
Right now we have a confusing mix of features and naming around
telemetry export. The code references both `otel` and `telemetry_only`,
and it's unclear how to opt in or out of sending trace information to
Tempo/OpenTelemetry backends. Additionally, there are a few bugs
preventing OpenTelemetry from working correctly in production
deployments.

Instead of `telemetry_only`, use `opentelemetry.skip` as a span field to
opt out of sending specific spans to Tempo/OpenTelemetry.

By default, all spans are sent to OpenTelemetry/Tempo backends (when the
`opentelemetry` feature is enabled), regardless of log level. To exclude
a span from OpenTelemetry export while keeping it in local traces
(Chrome trace, logs), add `opentelemetry.skip` to the span fields:

```rust
// This span will NOT be sent to Tempo
fn internal_helper() { }

// This span WILL be sent to Tempo (default behavior)
fn public_api() { }
```

**Key Changes:**

**Naming & Clarity:**
- Renamed `otel` feature to `opentelemetry` for clarity
- Renamed `--otel-exporter-otlp-endpoint` CLI flag to
`--otlp-exporter-endpoint`
- Renamed `--otel-trace-file` CLI flag to `--chrome-trace-file` (more
accurate)
- Standardized all environment variables to use
`LINERA_OTLP_EXPORTER_ENDPOINT`
- Chrome trace export includes ALL spans (useful for local debugging)
- OpenTelemetry/Tempo export respects the `opentelemetry.skip` filter

**Bug Fixes:**
- **Fixed empty endpoint validation**: Previously, passing an empty
string as the OTLP endpoint would cause the exporter to fail with
"Failed to parse endpoint". Now properly validates that the endpoint is
non-empty before attempting to create the exporter.
- **Fixed faucet crash with OpenTelemetry**: The OTLP exporter was being
initialized before the Tokio runtime was created, causing a panic:
"there is no reactor running, must be called from the context of a Tokio
1.x runtime". Moved tracing initialization to after runtime creation.

**Test Improvements:**
- Added proper feature guards (`with_testing`) for test-only functions
- CI tests verify that spans with `opentelemetry.skip` are filtered from
OpenTelemetry export but still appear in Chrome traces

- CI tests verify that spans with `opentelemetry.skip` are filtered from
OpenTelemetry export but still appear in Chrome traces
- All existing tests pass with the renamed feature and CLI flags
- Manual testing confirms OTLP traces now successfully export to Tempo
without crashes

- These changes should be backported to the latest `testnet` branch,
then
    - be released in a validator hotfix.

Might be good to backport just so if we want to add more instrumentation
on the testnet, things are not as confusing :)
ndr-ds added a commit that referenced this pull request Oct 30, 2025
Right now we have a confusing mix of features and naming around
telemetry export. The code references both `otel` and `telemetry_only`,
and it's unclear how to opt in or out of sending trace information to
Tempo/OpenTelemetry backends. Additionally, there are a few bugs
preventing OpenTelemetry from working correctly in production
deployments.

Instead of `telemetry_only`, use `opentelemetry.skip` as a span field to
opt out of sending specific spans to Tempo/OpenTelemetry.

By default, all spans are sent to OpenTelemetry/Tempo backends (when the
`opentelemetry` feature is enabled), regardless of log level. To exclude
a span from OpenTelemetry export while keeping it in local traces
(Chrome trace, logs), add `opentelemetry.skip` to the span fields:

```rust
// This span will NOT be sent to Tempo
fn internal_helper() { }

// This span WILL be sent to Tempo (default behavior)
fn public_api() { }
```

**Key Changes:**

**Naming & Clarity:**
- Renamed `otel` feature to `opentelemetry` for clarity
- Renamed `--otel-exporter-otlp-endpoint` CLI flag to
`--otlp-exporter-endpoint`
- Renamed `--otel-trace-file` CLI flag to `--chrome-trace-file` (more
accurate)
- Standardized all environment variables to use
`LINERA_OTLP_EXPORTER_ENDPOINT`
- Chrome trace export includes ALL spans (useful for local debugging)
- OpenTelemetry/Tempo export respects the `opentelemetry.skip` filter

**Bug Fixes:**
- **Fixed empty endpoint validation**: Previously, passing an empty
string as the OTLP endpoint would cause the exporter to fail with
"Failed to parse endpoint". Now properly validates that the endpoint is
non-empty before attempting to create the exporter.
- **Fixed faucet crash with OpenTelemetry**: The OTLP exporter was being
initialized before the Tokio runtime was created, causing a panic:
"there is no reactor running, must be called from the context of a Tokio
1.x runtime". Moved tracing initialization to after runtime creation.

**Test Improvements:**
- Added proper feature guards (`with_testing`) for test-only functions
- CI tests verify that spans with `opentelemetry.skip` are filtered from
OpenTelemetry export but still appear in Chrome traces

- CI tests verify that spans with `opentelemetry.skip` are filtered from
OpenTelemetry export but still appear in Chrome traces
- All existing tests pass with the renamed feature and CLI flags
- Manual testing confirms OTLP traces now successfully export to Tempo
without crashes

- These changes should be backported to the latest `testnet` branch,
then
    - be released in a validator hotfix.

Might be good to backport just so if we want to add more instrumentation
on the testnet, things are not as confusing :)
ndr-ds added a commit that referenced this pull request Oct 30, 2025
## Motivation
Backport a few PRs:

- **Make sending traces to Tempo less confusing (#4771)**
- **Add otlp_exporter_endpoint option to proxy/server as well (#4829)**
- **Stop using init and entrypoint scripts (#4830)**
- **Use release_max_level_trace instead (#4554)**
- **Make all heap profile logs trace (#4845)**
- **exporter: Add retry logging to indexer connection (#4846)**
- **Fix notification forwarding to avoid duplicate messages and handle
lag (#4848)**
- **Fix a few explorer frontend bugs (#4849)**
- **Add support for exporter/indexer/explorer stack deployment (#4850)**
- **Stop getting service monitor CRD from GitHub (#4855)**
- **Fix small bug on exporter config (#4870)**
- **Make sure that the Scylla config ConfigMap exists when Scylla gets
created (#4871)**
- **Remove UNIQUE constraints – those should be tracked by the node
anyway (#4852)**

## Proposal

Backport

## Test Plan

CI

---------

Co-authored-by: deuszx <95355183+deuszx@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants