DataFusion Tracing

DataFusion Tracing is an extension for Apache DataFusion that helps you monitor and debug queries. It uses tracing and OpenTelemetry to gather DataFusion metrics, trace execution steps, and preview partial query results.

Note: This is not an official Apache Software Foundation release.

Overview

When you run queries with DataFusion Tracing enabled, it automatically adds tracing around execution steps, records all native DataFusion metrics such as execution time and output row count, lets you preview partial results for easier debugging, and integrates with OpenTelemetry for distributed tracing. This makes it simpler to understand and improve query performance.

See it in action

Here's what DataFusion Tracing can look like in practice:

Jaeger UI

DataDog UI

Getting Started

Installation

Include DataFusion Tracing in your project's Cargo.toml:

[dependencies]
datafusion = "51.0.0"
datafusion-tracing = "51.0.0"

Quick Start Example

use datafusion::{
    arrow::{array::RecordBatch, util::pretty::pretty_format_batches},
    error::Result,
    execution::SessionStateBuilder,
    prelude::*,
};
use datafusion_tracing::{
    instrument_with_info_spans, pretty_format_compact_batch, InstrumentationOptions,
};
use std::sync::Arc;
use tracing::field;

#[tokio::main]
async fn main() -> Result<()> {
    // Initialize tracing subscriber as usual
    // (See examples/otlp.rs for a complete example).

    // Set up tracing options (you can customize these).
    let options = InstrumentationOptions::builder()
        .record_metrics(true)
        .preview_limit(5)
        .preview_fn(Arc::new(|batch: &RecordBatch| {
            pretty_format_compact_batch(batch, 64, 3, 10).map(|fmt| fmt.to_string())
        }))
        .add_custom_field("env", "production")
        .add_custom_field("region", "us-west")
        .build();

    let instrument_rule = instrument_with_info_spans!(
        options: options,
        env = field::Empty,
        region = field::Empty,
    );

    let session_state = SessionStateBuilder::new()
        .with_default_features()
        .with_physical_optimizer_rule(instrument_rule)
        .build();

    let ctx = SessionContext::new_with_state(session_state);

    let results = ctx.sql("SELECT 1").await?.collect().await?;
    println!(
        "Query Results:\n{}",
        pretty_format_batches(results.as_slice())?
    );

    Ok(())
}

A more complete example can be found in the examples directory.

Setting Up a Collector

Before diving into DataFusion Tracing, you'll need to set up an OpenTelemetry collector to receive and process the tracing data. There are several options available:

Jaeger (Local Development)

For local development and testing, Jaeger is a great choice. It's an open-source distributed tracing system that's easy to set up. You can run it with Docker using:

docker run --rm --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  -p 5778:5778 \
  -p 9411:9411 \
  jaegertracing/jaeger:2.7.0

Once running, you can access the Jaeger UI at http://localhost:16686. For more details, check out their getting started guide.

DataDog (Cloud-Native)

For a cloud-native approach, DataDog offers a hosted solution for OpenTelemetry data. You can send your traces directly to their platform by configuring your DataDog API key and endpoint - their OpenTelemetry integration guide has all the details.

Other Collectors

Of course, you can use any OpenTelemetry-compatible collector. The official OpenTelemetry Collector is a good starting point if you want to build a custom setup.

Using with Multiple Optimizer Rules

If you're using custom physical optimizer rules alongside the instrumentation rule, always register the instrumentation rule last in your physical optimizer chain so that:

You capture the final optimized plan, not an intermediate one.
Instrumentation stays purely observational—other optimizer rules never have to deal with instrumented nodes.

To keep the instrumentation rule last in the chain, either chain calls:

builder.with_physical_optimizer_rule(rule_a)
    .with_physical_optimizer_rule(rule_b)
    .with_physical_optimizer_rule(instrument_rule)

Or pass a vector:

builder.with_physical_optimizer_rules(vec![..., instrument_rule])

InstrumentedExec visibility

Instrumentation is designed to be mostly invisible: with the rule registered last, other optimizer rules typically never see InstrumentedExec at all. The wrapper itself is intentionally private so downstream code cannot depend on its internals; the supported surface is the optimizer rule and the standard ExecutionPlan trait.

Repository Structure

The repository is organized as follows:

datafusion-tracing/: Core tracing functionality for DataFusion
instrumented-object-store/: Object store instrumentation
integration-utils/: Integration utilities and helpers for examples and tests (not for production use)
examples/: Example applications demonstrating the library usage
tests/: Integration tests
docs/: Documentation, including logos and screenshots

Building and Testing

Use these commands to build and test:

cargo build --workspace
cargo test --workspace

Test data: generate TPCH Parquet files

Integration tests and examples expect TPCH tables in Parquet format to be present in integration-utils/data (not checked in). Generate them locally with:

cargo install tpchgen-cli
./dev/generate_tpch_parquet.sh

This produces all TPCH tables at scale factor 0.1 as single Parquet files in integration-utils/data. CI installs tpchgen-cli and runs the same script automatically before tests. If a required file is missing, the helper library will return a clear error instructing you to run the script.

Contributing

Contributions are welcome. Make sure your code passes all tests, follow existing formatting and coding styles, and include tests and documentation. See CONTRIBUTING.md for detailed guidelines.

License

Licensed under the Apache License, Version 2.0. See LICENSE.

Acknowledgments

This project includes software developed at Datadog (info@datadoghq.com).

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github		.github
ci		ci
datafusion-tracing		datafusion-tracing
dev		dev
docs/logo		docs/logo
examples		examples
instrumented-object-store		instrumented-object-store
integration-utils		integration-utils
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
licenserc.toml		licenserc.toml
pre-commit.sh		pre-commit.sh
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml
taplo.toml		taplo.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DataFusion Tracing

Overview

See it in action

Getting Started

Installation

Quick Start Example

Setting Up a Collector

Jaeger (Local Development)

DataDog (Cloud-Native)

Other Collectors

Using with Multiple Optimizer Rules

InstrumentedExec visibility

Repository Structure

Building and Testing

Test data: generate TPCH Parquet files

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

datafusion-contrib/datafusion-tracing

Folders and files

Latest commit

History

Repository files navigation

DataFusion Tracing

Overview

See it in action

Getting Started

Installation

Quick Start Example

Setting Up a Collector

Jaeger (Local Development)

DataDog (Cloud-Native)

Other Collectors

Using with Multiple Optimizer Rules

InstrumentedExec visibility

Repository Structure

Building and Testing

Test data: generate TPCH Parquet files

Contributing

License

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages