-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Make sending traces to Tempo less confusing #4771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sending traces to Tempo less confusing #4771
Conversation
49d1ce7 to
cc17124
Compare
cc17124 to
3239148
Compare
6598199 to
a96c5e8
Compare
linera-base/src/tracing_otel.rs
Outdated
|
|
||
| let telemetry_only_filter = | ||
| filter_fn(|metadata| metadata.is_span() && metadata.target() == "telemetry_only"); | ||
| let tempo_filter = filter_fn(|metadata| metadata.is_span() && metadata.target() != "no_tempo"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface here is OpenTelemetry; Tempo is our OpenTelemetry consumer, provided by the infra, so it shouldn't be named in the codebase. Some validators might prefer to use a different OpenTelemetry aggregator for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I guess this is not Tempo specific at all
linera-base/src/tracing_otel.rs
Outdated
|
|
||
| let telemetry_only_filter = | ||
| filter_fn(|metadata| metadata.is_span() && metadata.target() == "telemetry_only"); | ||
| let tempo_filter = filter_fn(|metadata| metadata.is_span() && metadata.target() != "no_tempo"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The target is useful to have in events. Rather than a custom target, could we just standardize a field (e.g. "opentelemetry/disable": true)?
a96c5e8 to
9e69778
Compare
linera-base/src/tracing_otel.rs
Outdated
| /// Fallback when tempo feature is not enabled. | ||
| #[cfg(not(feature = "tempo"))] | ||
| /// Fallback when otel feature is not enabled. | ||
| #[cfg(not(feature = "otel"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I don't love the otel abbreviation. We mostly try to avoid abbreviations as a style policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me :) (modulo formatting!)
linera-base/src/tracing_otel.rs
Outdated
| metadata | ||
| .fields() | ||
| .field("otel.skip") | ||
| .is_none() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it would be nice to require the value be true in order to disable; this makes it easier to do more complex conditionals, like
#[tracing::instrument(fields(otel.skip = !otel))]
fn foo(otel: bool) {
…
}
linera-base/src/tracing_otel.rs
Outdated
| metadata | ||
| .fields() | ||
| .field("otel.skip") | ||
| .is_none() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: to avoid the double negative, we could alternatively name it otel.emit, and skip only when it's false (i.e. the default value if not provided is considered to be true).
ef9cc62 to
0c1cadf
Compare
0c1cadf to
481c260
Compare
Right now we have a confusing mix of features and naming around
telemetry export. The code references both `otel` and `telemetry_only`,
and it's unclear how to opt in or out of sending trace information to
Tempo/OpenTelemetry backends. Additionally, there are a few bugs
preventing OpenTelemetry from working correctly in production
deployments.
Instead of `telemetry_only`, use `opentelemetry.skip` as a span field to
opt out of sending specific spans to Tempo/OpenTelemetry.
By default, all spans are sent to OpenTelemetry/Tempo backends (when the
`opentelemetry` feature is enabled), regardless of log level. To exclude
a span from OpenTelemetry export while keeping it in local traces
(Chrome trace, logs), add `opentelemetry.skip` to the span fields:
```rust
// This span will NOT be sent to Tempo
fn internal_helper() { }
// This span WILL be sent to Tempo (default behavior)
fn public_api() { }
```
**Key Changes:**
**Naming & Clarity:**
- Renamed `otel` feature to `opentelemetry` for clarity
- Renamed `--otel-exporter-otlp-endpoint` CLI flag to
`--otlp-exporter-endpoint`
- Renamed `--otel-trace-file` CLI flag to `--chrome-trace-file` (more
accurate)
- Standardized all environment variables to use
`LINERA_OTLP_EXPORTER_ENDPOINT`
- Chrome trace export includes ALL spans (useful for local debugging)
- OpenTelemetry/Tempo export respects the `opentelemetry.skip` filter
**Bug Fixes:**
- **Fixed empty endpoint validation**: Previously, passing an empty
string as the OTLP endpoint would cause the exporter to fail with
"Failed to parse endpoint". Now properly validates that the endpoint is
non-empty before attempting to create the exporter.
- **Fixed faucet crash with OpenTelemetry**: The OTLP exporter was being
initialized before the Tokio runtime was created, causing a panic:
"there is no reactor running, must be called from the context of a Tokio
1.x runtime". Moved tracing initialization to after runtime creation.
**Test Improvements:**
- Added proper feature guards (`with_testing`) for test-only functions
- CI tests verify that spans with `opentelemetry.skip` are filtered from
OpenTelemetry export but still appear in Chrome traces
- CI tests verify that spans with `opentelemetry.skip` are filtered from
OpenTelemetry export but still appear in Chrome traces
- All existing tests pass with the renamed feature and CLI flags
- Manual testing confirms OTLP traces now successfully export to Tempo
without crashes
- These changes should be backported to the latest `testnet` branch,
then
- be released in a validator hotfix.
Might be good to backport just so if we want to add more instrumentation
on the testnet, things are not as confusing :)
Right now we have a confusing mix of features and naming around
telemetry export. The code references both `otel` and `telemetry_only`,
and it's unclear how to opt in or out of sending trace information to
Tempo/OpenTelemetry backends. Additionally, there are a few bugs
preventing OpenTelemetry from working correctly in production
deployments.
Instead of `telemetry_only`, use `opentelemetry.skip` as a span field to
opt out of sending specific spans to Tempo/OpenTelemetry.
By default, all spans are sent to OpenTelemetry/Tempo backends (when the
`opentelemetry` feature is enabled), regardless of log level. To exclude
a span from OpenTelemetry export while keeping it in local traces
(Chrome trace, logs), add `opentelemetry.skip` to the span fields:
```rust
// This span will NOT be sent to Tempo
fn internal_helper() { }
// This span WILL be sent to Tempo (default behavior)
fn public_api() { }
```
**Key Changes:**
**Naming & Clarity:**
- Renamed `otel` feature to `opentelemetry` for clarity
- Renamed `--otel-exporter-otlp-endpoint` CLI flag to
`--otlp-exporter-endpoint`
- Renamed `--otel-trace-file` CLI flag to `--chrome-trace-file` (more
accurate)
- Standardized all environment variables to use
`LINERA_OTLP_EXPORTER_ENDPOINT`
- Chrome trace export includes ALL spans (useful for local debugging)
- OpenTelemetry/Tempo export respects the `opentelemetry.skip` filter
**Bug Fixes:**
- **Fixed empty endpoint validation**: Previously, passing an empty
string as the OTLP endpoint would cause the exporter to fail with
"Failed to parse endpoint". Now properly validates that the endpoint is
non-empty before attempting to create the exporter.
- **Fixed faucet crash with OpenTelemetry**: The OTLP exporter was being
initialized before the Tokio runtime was created, causing a panic:
"there is no reactor running, must be called from the context of a Tokio
1.x runtime". Moved tracing initialization to after runtime creation.
**Test Improvements:**
- Added proper feature guards (`with_testing`) for test-only functions
- CI tests verify that spans with `opentelemetry.skip` are filtered from
OpenTelemetry export but still appear in Chrome traces
- CI tests verify that spans with `opentelemetry.skip` are filtered from
OpenTelemetry export but still appear in Chrome traces
- All existing tests pass with the renamed feature and CLI flags
- Manual testing confirms OTLP traces now successfully export to Tempo
without crashes
- These changes should be backported to the latest `testnet` branch,
then
- be released in a validator hotfix.
Might be good to backport just so if we want to add more instrumentation
on the testnet, things are not as confusing :)
## Motivation Backport a few PRs: - **Make sending traces to Tempo less confusing (#4771)** - **Add otlp_exporter_endpoint option to proxy/server as well (#4829)** - **Stop using init and entrypoint scripts (#4830)** - **Use release_max_level_trace instead (#4554)** - **Make all heap profile logs trace (#4845)** - **exporter: Add retry logging to indexer connection (#4846)** - **Fix notification forwarding to avoid duplicate messages and handle lag (#4848)** - **Fix a few explorer frontend bugs (#4849)** - **Add support for exporter/indexer/explorer stack deployment (#4850)** - **Stop getting service monitor CRD from GitHub (#4855)** - **Fix small bug on exporter config (#4870)** - **Make sure that the Scylla config ConfigMap exists when Scylla gets created (#4871)** - **Remove UNIQUE constraints – those should be tracked by the node anyway (#4852)** ## Proposal Backport ## Test Plan CI --------- Co-authored-by: deuszx <95355183+deuszx@users.noreply.github.com>

Motivation
Right now we have a confusing mix of features and naming around telemetry export. The code references both
otelandtelemetry_only, and it's unclear how to opt in or out of sending trace information to Tempo/OpenTelemetry backends. Additionally, there are a few bugs preventing OpenTelemetry from working correctly in production deployments.Proposal
Instead of
telemetry_only, useopentelemetry.skipas a span field to opt out of sending specific spans to Tempo/OpenTelemetry.By default, all spans are sent to OpenTelemetry/Tempo backends (when the
opentelemetryfeature is enabled), regardless of log level. To exclude a span from OpenTelemetry export while keeping it in local traces (Chrome trace, logs), addopentelemetry.skipto the span fields:Key Changes:
Naming & Clarity:
otelfeature toopentelemetryfor clarity--otel-exporter-otlp-endpointCLI flag to--otlp-exporter-endpoint--otel-trace-fileCLI flag to--chrome-trace-file(more accurate)LINERA_OTLP_EXPORTER_ENDPOINTopentelemetry.skipfilterBug Fixes:
Test Improvements:
with_testing) for test-only functionsopentelemetry.skipare filtered from OpenTelemetry export but still appear in Chrome tracesTest Plan
opentelemetry.skipare filtered from OpenTelemetry export but still appear in Chrome tracesRelease Plan
testnetbranch, thenMight be good to backport just so if we want to add more instrumentation on the testnet, things are not as confusing :)